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PACKET PROCESSING SYSTEM 
INCLUDING A POLICY ENGINE HAVING A - - 
CLASSIFICATION UNIT 

FIELD OF THE INVENTION 

The present invention relates to computer networks and, 
more particularly, to a general purpose programmable plat- 
form for acceleration of network infrastructure applications. 

BACKGROUND OF THE INVENTION 

Computer networks have become a key part of the cor- 
porate infrastructure. Organizations have become increas- 
ingly depending on intranets and the Internet and are 
demanding much greater levels of performance from their 
network infrastructure. T he network infrastructure is be ing 15 
viewed: (j) as a competitive advantage; (2) as missio n 
critical; (3) as a cost center The jrjfrftctnirtiirp itc^f ic 
tran sitio ning from 10 Mb/s (megabits per second) c apability 
to \ WO Mb7s capability. S oon, infrastructure capable, af 1 
Gb/s. (gigabits per secondl will start appearing on serve r 20 
connections, trunks and backbones. As more and m ore 
computing equipment gets deployed, the number of node s 
w ithin an organization has also grown. There has been a 
doubling of users, and a ten -fold increase in the amount of 
traffic every year — ^ 

Network infrastructure applications monitor, manage and 
manipulate network traffic in the fabric of computer net- 
works. The high demand for network bandwidth and con- 
nectivity has led to tremendous complexity and performance 
requirements for this class of application. Traditional meth- 30 
ods of dealing with the se problems are no longer adequate. 

Several sophisticated software applications that provide 
solutions to the problems encountered by the network man- 
ager have emerged. The main areas for such applications are 
Security, Quality of Service (QoS)/Class of Service (CoS) 35 
and Network Management Examples are: Firewalls; Instru- 
sion Detection; Encryption; Virtual Private Networks 
(VPN); enabling services for ISPs (load balancing and 
such); Accounting; Web billing; Bandwidth Optimization; 
Service Level Management; Commerce; Application Level 40 
Management; Active Network Management 

There are three conventional ways in which these appli- 
cations are deployed: 

(1) On general purpose computers. 45 

(2) Using single function boxes. 

(3) On switches and routers. 

It is instructive to examine the issues related to each of 
these deployment techniques. 

1. General Purpose Computers. 50 

General Purpose computers, such a s PCs running 
NT/Windows or workstations running Solaris/HP-UX, etc. 
are a common method for deploying network infrastructure 
applications. The typical configuration consists of two or 55 
more network interfaces each providing a connection to a 
network segment. The application runs on the main proces- 
sor (Pentium/SPARC etc.) and communicates with the Net- 
work Interface Controller (NIC) card either through 
(typically) the socket interface or (in some cases) a special- so 
ized driver "shim" in the operating system (OS). The "shim" 
approach allows access to "raw" packets, which is necessary 
for many of tbe packet oriented applications. Applications 
that are end-point oriented, such as proxies can interface to 
the top of the IP (Internet Protocol) or other protocol stack. 65 

The advantages of running the application on a general 
purpose computer include: a full development environment; 
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all the OS services (IPC, file system, memory management, 
- threads, I/O etc.); low cost due to ubiquity of tbe-platform; 
stability of the APIs; and assurance that performance will 
increase with each new generation of the general purpose 
computer technology. 

There are, however, many disadvantages of running the 
application on a general purpose computer. First, the I/O 
subsystem on a general purpose computer is optimized to 
provide a standard connection to a variety of peripherals at 
reasonable cost and, hence, reasonable performance 32 b/33 
MHz PCI ("Peripheral Connection Interface", the dominant 
I/O connection on common general purpose platforms 
today) has an effective bandwidth in the 50-75 MB/s range. 
While this is adequate for a few interfaces to high perfor- 
mance networks, it does not scale. Also, there is significant 
latency involved in accesses to the card. Therefore, any kind 
of non-pipelined activity results in a significant performance 
impact. 

Another disadvantage is that general purpose computers 
do not typically have good interrupt response time and 
context switch characteristics (as opposed to real-time oper- 
ating systems used in many embedded applications). While 
this is not a problem for most computing environments, it is 
far from ideal for a network infrastructure application. 
Network infrastructure applications have to deal with net- 
work traffic operating at increasingly higher speeds and less 
time between packets. Small interrupt response times and 
small context switch times are very necessary. 

Another disadvantage is that general purpose platforms do 
not have any specialized hardware that assist with network 
infrastructure applications. With rare exception, none of the 
instruction sets for general purpose computers are optimized 
for network infrastructure applications. 

Another disadvantage is that, on a general purpose 
computer, typical network applications are built on top of the 
TCP/IP stack. This severely limits the packet processing 
capability of the application. 

Another disadvantage is that packets need to be pulled 
into the processor cache for processing. Cache fills and write 
backs become a severe bottleneck for high bandwidth net- 
works. 

Finally, general purpose platforms use general purpose 
operating systems (OS's). These operating systems are gen- 
erally not known for having quick reboots on power-cycle or 
other wiring-closet appliance oriented characteristics impor- 
tant for network infrastructure applications. 

2. Fixed-Function Appliances. 

There are a double of different ways to build single 
function appliances. The first way is to take a single board 
computer, add in a couple of NIC cards, and run an executive 
program on the main processor. This approach avoids some 
of the problems that a general purpose OS brings, but the 
performance is still limited to that of the base platform 
architecture (as described above). 

A way to enhance the performance is to build special 
purpose hardware that performs functions required by the 
specific application very well. Therefore, from a perfor- 
mance standpoint, this can be a very good approach. 

There are, however, a couple of key issues with special 
function appliances. For example, they are not expandable 
by their very nature. If the network manager needs a new 
application, he/she will need to procure a new appliance. 
Contrast this with loading a new application on a desktop 
PC. In the case of a PC, a new appliance is not needed with 
every new application. 
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Finally, if the solution is not completely custom, it is 
unlikely that the-solution is scalable. Using a PC or other - 
single board computer as the packet processor for each 
location at which that application is installed is not cost- 
effective. • 5 

3. Switches and Routers. 

Another approach is to deploy a scaled down version of 
an application on switches and routers which comprise the 
fabric of the network. The advantages of this approach are 
that: (1) no additional equipment is required for the deploy- 
ment of the application; and (2) all of the segments in a 
network are visible at the switches. 

There are a number of problems with this approach. 15 

One disadvantage is that the processing power available at 
a switch or router is limited. Typically, this processing power 
is dedicated to the primary business of the switch/router — 
switching or routing. When significant applications have to 
be run on these switches or routers, their performance drops. 20 

Another disadvantage is that not all nodes in a network 
need to be managed in the same way. Putting significant 
processing power on all the ports of a switch or router is not 
cost-effective. 

Another disadvantage is that, even if processing power 25 
become so cheap as to be deployed freely at every port of a 
switch or router, a switch or router is optimized to move 
frames/packets from port to port. It is not optimized to 
process packets, for applications. 

30 

Another disadvantage is that a typical switch or router 
does not provide the facilities that are necessary for the 
creation and deployment of sophisticated network infra- 
structure applications. The services required can be quite 
extensive and porting an application to run on a switch or 
router can be very difficult. 

Finally, replacing existing network switching equipment 
with new versions that support new applications can be 
difficult. It is much more effective to "add applications" to 
the network where needed. 

What is needed is an optimized platform for the deploy- 
ment of sophisticated software applications in a network 
environment. 

SUMMARY 

The present invention relates to a general-purpose pro- 
grammable packet processing platform for accelerating net- 
work infrastructure applications which have been structured 
so as to separate the stages of classification and action^ A 
wide variety of embodiments of the present invention are 50 
possible and will be understood by those skilled in the art 
based on the present patent application. In certain 
embodiments, acceleration is achieved by one or more of the 
following: 

Dividing the steps of packet processing into a mu ltiplicity 
of pipel ine stages a nd providing different funct ional 
units ror ditterent stages, thus allowing more process- 
ing time per packet and also providing concurrency in 
Vh e ft roce^sTjig nt mi 1 1 1 ijti e.^ pa t >ft ft I ^ — — ^ ^ 

providing custom, specialized Classifica tion Engines 
which are micro -programmed processors optimizedl br 
jhe various functions common in p redicate analysis and 
tafale_ searches for these sort of ap pication^ and ar e 
each used as pipeline stages in differentj^affis, 65 



35 



45 



r 



Providing a general-purpose microprocessor for execii t- 
r mg the a r bitrary actions desired by i hese applications 



application 



Providing a tighdy-coupled encryption coprocessor to 
-accelerate common network encryption functions, 

Reducing or eliminating the need for the applications to 
examine the actual contents of the packet, thus mini- 
mizing the movement of packet data and the effects of 
that data movement on the processors' s cache/bus/ 
memory subsystem, and 

Either eliminating or providing special hardware to accel- 
erate system overheads common to embedded network 
applications run on general purpose platforms, this 
includes special support for managing buffer pools, for 
communication among units and the passing of buffers 
between them, and for managing the network interface 
MACs (media access controllers) without the need for 
heavyweight device driver programs. 

Recognizing a common policy enforcement module for 
network infrastructure applications 

Certain specific embodiments are implemented with one 
or more of the following features: 

a policy enforcement module consisting of Classification 
and associated Action 

both stateless classification and stateful classification 
which uses sets 

Provision of a high level interface to packet level Clas- 
sification and Action (Action and Classification 
Engine— ACE) 

Provision of the high level interface within common 
operating environments 

Policy can be changed dynamically 

Application partitioned into an AP module running on the 
AP (Application Processor) and a PE (Policy Engine) 
module running on the PE. 

AP can run operating systems will full services to facili- 
tate application development 

PE functionality embodied as software running on AP as 
well as hardware and software running on the hardware 
PE 

A language interface to describe Classification and to 

associate Actions with the results of the Classification 
Language (NetBoost Classification Language-NCL) for 

Classification/Action 

Object oriented (extensible) 

Specific to Classification and hence very simple 

Built-in intrinsics such as checksum 

Language constructs make it easy to describe layered 
protocols and protocol fields 

Rule construct to associate Classification and Actions 

Predicate construct which is a function of packet con- 
tents at any layer of any protocol and/or of hash 
search results 

Set ^cnnstnict to describe hash tables and multiple 
marches on the xamr, hash table 

Action code 

Written in high level language 

Complex packet processing possible 

Can avail of Application Services Library (ASL) pro- 
viding services useful for packet processing 

ASL consists of packet management, memory 
management, time and event management, link level 
services, packet timestamp service, cryptographic 
services, communication services to AP module plus 
extensions 

TCP/IP extensions include services such as Network 
Address Translation (NAT) for IP, TCP and UDP, 
Checksums, IP fragment reassembly and TCP seg- 
ment reassembly 
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System components include FIG. 7 shows a receive buffer format related to the present 

. library implementing API (DLL under Windows NT) - invention.- -- 

a management process called Resolver FIG. 8 shows a TX Ring Structure related to the present 

an incremental compiler for NCL invention. 

linker for NCL code 5 piG. 9 shows a transmit buffer format related to the 

dynamic linker for action code present invention. 

operating-system specific drivers which communicate _ , . , r . , , , 4 . 

with both hardware and software PEs 10 shows a reclassif y rm § strUCturc rclatcd to the 

software Policy Engine that executes Classification and P 1 ^ 111 inventl0n * 

Action code 10 FIG - 11 shows a Crypto Ring and COM[4:0] Rings 

ASL for Action code Structure related to the present invention, 

management services (Resolver and Plumber) for both FIG. 12 shows a DMA Ring Structure related to the 

application developer and the end-user present invention, 

development environment for AP and PE code includ- ^IG. 13 is a classification engine block diagram related to 

ing compilers, and other software development tools 15 the present invention. 

familiar to those skilled in the art FIG. 14 is a pipeline timing diagram for the classification 

ACE engine related to the present invention. 

C++ object which abstracts the packet processing asso- RG 15 ^ ^ application strU cture diagram related to the 

ciated with an application or sub-application nt invention> 

Provides a context for Classification and Action 20 . ,. , . A ,,■/-«•* 

^ 4 . ^ . u- * • 1 j* j FIG. 16 is a diagram showing an Action Classification 

Contains one or more Target objects, including drop . 4 & , & 

and default, which represent packet destinations Engine < ACE ) related t0 the P resent mvenUon - 

Provides a context for upcalls and downcalls between FIG - 17 shows a cascade of ACEs related to the present 

the AP and PE modules invention. 

Targets of an ACE are connected to other ACEs or 25 FIG. 18 shows a system architecture related to the present 

interfaces using the Plumber (graphical and pro- invention, 

grammatic interfaces) to specify the serialization of piG. 19 shows an application deploying six ACEs related 

ACE processing to the present invention. 
Operating environment for action code 

Invokes actions automatically when associated classi- 30 DETAILED DESCRIPTION OF PREFERRED 

fication succeeds • EMBODIMENTS 
Implements an ACE context Network infrastructure applications generally contaiD 
Low overhead (soft real-time) environment both time-critical and non-time-critical sections. The non- 
Handles communication between AP and PE ^ time-critical sections generally deal with setup, 
Performs dynamic linking of action code when ACEs configuration, user interface and policy management. The 
are loaded with new Classification code time-critical sections generally deal with policy enforce- 
Resolver ment. The policy enforcement piece generally has to run at 
Maintains namespace of applications, interfaces and network speeds. The present invention pertains to an effi- 
ACEs cient architecture for policy enforcement that enables appli- 
Maps ACEs to PEs automatically cation of complex policy at network rates. 
Contains the compiler for NCL and does dynamic FIG. 1 shows a Network Infrastructure Application, called 
compilation of NCL Application 2, being deployed on an Application Processor 
Provides the interfaces for management of applications, (/j>) 4 mnn in g a standard operating system. The policy 
ACEs and interfaces 45 enforcement section of the Application 2, called Wire Speed 
Compiler for NCL Policy 3 runs on the Policy Engine (PE) 6. The Policy 
Generates code for multiple processors (AP and PE) Engine 6 transforms the inbound Packet Stream 8 into the 
Allows incremental compilation of rules outbound Packet Stream 10 per the Wire Speed Policy 3, 
Plumber Communications from the Application Processor 4 to the 
Allows interconnection of ACEs so Policy Engine 6, in addition to the Wire Speed Policy 3, 
Allow binding to interfaces consists of control, policy modifications and packet data as 
Supports secure remote access desired by the Application 2, Communications from the 

Policy Engine 6 to the Application Processor 4 consists of . 

BRIEF DESCRIPTION OF THE DRAWINGS status, exception conditions and packet data as described by 

, FIG. 1 is a block diagram of a system in accordance with 5S me A PP hcation 2 - m 

the present invention. In a Purred embodiment of a Policy Engine (PE) 

™„ -> . ui 1 j- u 1 ™r according to the present invention, the PE provides a highly 

FIG. 2 is a block diagram showing packet flow according 6 , , . f r . * t A 

... , -.J* t -„jl t . programmable platform for classifying network packets and 

to an embodiment of the present invention. K r . , J f. t . 

^ . r „ implementing policy decisions about those packets at wire 

FIG. 3 is a Policy Engine ASIC block diagram according 6Q fipeed Certam embodimerits provide two Fast Elh emet ports 

to the present invention. md implement a pipelined dataflow architecture with store- 

FIG. 4 is a sample system-level block diagram related to and-forward. P ackets are run through a Cl assification 

the present invention. Engine (CE) which executes a programmed series or harcF" 

FIG. 5 shows a ring array in memory related to the present ware assist operations such as chained field comparisons and 

invention. 65 generation of checksums a raLhasfi.table.DQinters, then are 

FIG. 6 shows an RX Ring Structure related to the present handed to a microprocessor ("Policy Frocessor^ oxJZ^fe 1 

invention. execution of policy decisions sucn as Pass, Drop, Enqueue/ 
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Delay, (de/en)capsulate, and (de/en)crypt based on the 
results, from the CE. Some packets which require higher 
level processing may be sent to the host computer system 
("Application Processor" or AP). (See FIG. 4.) An optional 
cryptographic ("Crypto") Processor is provided for acceler- 
ating such functions as encryption and key management. 

Third-party applications such as firewalls, rate shaping, 
QoS/CoS, network management and others can be imple- 
mented to take advantage of this three-tiered approach to 
filtering packets. Support for easy encapsulation without 
copies combined with encryption support allows for VPNs 
("Virtual Private Networks") and other applications that 
require security services. 

^ large parity-protected synchronous DRAM (SDRAM) 
buffer memory is provided, along with a PCI interlace that 
is used for communication with the host (AP) and poten- 
tially for peer-to-peer communication among Policy 
Engines, e.g. for applications which route and switch. 

In certain embodiments the Policy Engine ASIC can be 
used on a PCI card both for application software develop- 
ment and for use in a PC or workstation as a two interface 
product, and can also be used in a multiple-segment appli- 
ance with a plurality of PE's along with an embedded 
Application Processor for a stand-alone product. 

I n certain embodiments, wheqj jsr d in an annlianrr. the 
PE s reside on PCI segments connected together thrmigh a 
jjjiraiity r>f PPT.ffl .ppT hrjdges which connect to the hos t 
RCIjtus_on the Application Processor. The PCI bus is 64- bit 
for all agents in order to provide sufficient bandwid th for 
app lications which route or switch. 

A sample system level block diagram is shown in FIG. 4. 

FIG. 4 shows an application processor 302 which contains 
a host interface 304 to a PCI bus 324. Fanout of the PCI bus 
324 to a larger number of loads is accomplished with 
PCI-to-PCI Bridge devices 306, 308, 310, and 312; each of 
those controls an isolated segment on a "child" PCI bus 326, 
328, 330, and 332 respectively. On three of these isolated 
segments 326, 328, and 330 is a number of Policy Engines 
322; each Policy Engine 322 connects to two Ethernet ports 
320 which connects the Policy Engine 322 to a network 
segment. 

One of the PCI-to-PCI Bridges 312 controls child PCI bus 
332 which provides the Application Processor 302 with 
connection to standard I/O devices through local I/O 314 
and optionally to PCI expansion slots 316 into which 
additional PCI devices can be connected. 

In a smaller configuration of the preferred embodiment of 
the invention the number of Policy Engines 322 does not 
exceed the maximum load allowed on a PCI bus 324; in that 
case the PCI-to-PCI bridges 306, 308, and 310 are elimi- 
nated and up to four Policy Engines 322 are connected 
directly to the host PCI bus 324, each connecting also to two 
Ethernet ports 320. This smaller configuration may still have 
the PCI-to-PCI Bridge 312 present to isolate Local I/O 314 
and expansion slots 316 from the PCI bus 324, or the Bridge 
312 may also be eliminated and the local I/O 314 and 
expansion slots 316 may also be connected directly to the 
host PCI bus 324. 

I. Packet Flow 

Tnj^ojn^Ti^Him^Ti^ fh " P F' 1' M zcs tw ° Ethernet 
MAC J s~(Media Access Co ntrollers) with IEEE gPTTs tah- 
dard Media Independent Interlace ("MII^l goDnectioDS to 
exTCrTTaT pnysical media^^ffV) devices, which attach to 
Ethernet segments. Each "Ethernet MAC receives packets 
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into buffers addressed by buffer points obtained from a 
_ producer^consumer ring and then.passes the .buffer (that.is, _ 
passes the buffer pointer) to a Classification Engine for 
processing, and from there to the Policy Processor. The 
5 "buffer pointer" is a data structure comprising the address of 
a buffer and a software-assigned "tag" field containing other 
information about that buffer. The "buffer pointer^' is a 
fiuir1ar r "'' n<g1 u nit of comrrninjcation among the vario us 
hardware and software modules comprising a PE. From t he 
1Q BP, there are many paths the packet can take, depending on 
what the application^ runn in g "*i the PP decide is t he 
p roper dispositionof that packet. It can be transmitted, sen t 
to Crypto, delayed in memory, passed through a Classiri ca- 

tin n Engine ag ain fnr f arther processing. Or copied Irom tn e 

15 Pg J s memory over the PCI bus to the host's memory_o r to 
a peer device's memory, using the DMA engine. The PP m ay 
also gather statistics on that packet into records in a hash 
taMe or in general memory. A pointer to the butter contain- 
i ng both the packet and data structures describing that pa cket 

2Q is passed armmri among ihe. various modules. 

The PP may choose to drop a packet, to modify the 
contents of the packet, or to forward the packet to the AP or 
to a different network segment over the PCI Bus (e.g. for 
routing.) The AP or PP can create packets of its own for 

25 transmission. A 3rd-party NIC (Network Interface Card) on 
the PCI bus can use the PE memory for receiving packets, 
and the PP and AP can then cooperate to feed those packets 
into the classification stream, effectively providing accel- 
eration for packets from arbitrary networks. When doing so, 

30 adjacent 2 KB buffers can be concatenated to provide buffers 
of any size needed for a particular protocol. 

FIG. 2 illustrates packet flow according to certai n 
embodiment of the present invention. Each box represen ts 
^process which is applied to a packet buffer and/or th e 

35 contents of a packet buffer 620 as shown in FIG. 7^ The 
buffer management process involves buffer allocation 102 
and the recovery of retired buffers 118. When buffer allo- 
c ation 102 into an RX Ring 402 or 404 occurs, the Policy 
Processor 244 enqueues a buffer p ointer into the Ra Ring 

40 402 or 404 and thus allocates the buffer 620 to the receive 
MAC 216 or 230, respectively. Upon receiving a packet, the 
RX MAC controller 220 or 228 uses the buffer pointer at the 
entry in the RX ring structure of FIG. 6 which is pointed to 
by MFILL 516 to identify a 2 KB section of memory 260 

45 that it can use to store the newly received packet. This 
process of receiving a packet and placing it into a buffer 620 
is represented by physical receive 104 in FIG. 2. 

The RX MAC controller 220 or 228 increments the 
MFILL pointer 516 modulo ring size to signal that the buffer 

50 620 whose pointer is in the RX Ring 402 or 404 has been 
filled with a new packet 610 and 612 plus receive status 600 
and 602. The Ring Translation Unit 264 detects a difference 
between \1FILL 516 and MCCONS 514 and signals to the 
classification engine 238 or 242, respectively, for RX Ring 

55 402 or 404, that a newly received packet is ready for 
processing. The Classification Engine 238 or 242 applies 
Classification 106 to that packet and creates a description of 
the packet which is placed in the packet buffer software area 
614, then increments MCCONS 514 to indicate that it has 

60 completed classification 106 of that packet. The Ring Trans- 
lation Unit 264 detects a difference between MCCONS 514 
and MPCONS 512 and signals to the Policy Processor 244 
that a classified packet is ready for action processing 108. 
"The Bolicy Processor 2 44 o btains the buffer pointer fro m 

65 f he rin gjlocauon poimed 10 by 3 12 by dequeueing that 
po inter from trie KA King 4U2 or 404, and execu tes 
application-specific action code 108 to determine the dis- 
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position of the packet. The action code 108 may choose t o 
s end the_nacket to an Ethernet Transmit MAC 218 ol234~5 v 
e nqueueing the buffer pointer on a TX Ring 406 or 408 . 
respectively; the packet may or may not have been modified 
by the action code 108 prior to this . Alternatively the action 5 
code 1UH may choose to send the packet to the attached 
cryptographic processor (Crypto) 246 for encryption, 
decryption, compression, decompression, security key 
management, parsing of IPSEC headers, or other associated 
functions; this entire bundle of functions is described Crypto 
112. Alternatively the action code 108 may choose to cop y 
the packet to a r ui peer i22 or 314 or 316, or to the ho st 
memory JHU, both pains being accomplished by the proces s 
oIcreatiDg a umA descriptor 114 as shown in Table 3 a nd 
t iien enqueuing the pointer to that descriptor into DMA Ri ng 
4_i8 by writing that pointer to DMA PROD~1116 r which ^ 
t riggers the DMA Unit 210 to initi ate a transfe r. Alterna - 
t ively the action code 108 can ^hoose^o tem p orarily, 
enqueue the packet for delay 110 in memory 2(fflthat is 
managed by the action code 108. Finally, the action code 10 8 
can choose to send a packet for further cl assification 106 o n 20 
any of the Class'ification Engines 208, 212 T 238. or 242 . 
either because the packet has been modified or because the re 
i s aadifio'naf'classmcation which can' be run on the pac ket 
whic h the action code 108 can command the Classificatio n 
process lUb to execute via hags in the KX Status vVord 600 , 
through the buffer's software area 614. or .by use of tag bit s 
'i n the 32-bit buffer pointer reserved for that use " 

Packets can arrive at the classification process 106 from 
additional sources besides physical receive 104. Classifica- 
tion 106 may receive a packet from the output of the Crypto 
processing 112, from the Application Processor 302 or from 
a PCI peer 322 or 314 or 316, or from the action code 108. 

Packets can arrive at the action code 108 from classifi- 
cation 106, from the Application Processor 302, from a PCI 
peer 322 or 314 or 316, from the output of the Crypto 
processing 112, and from a delay queue 110. Additionally 
the action code 108 can create a packet. The disposition 
options for these packets are the same as those described for 
the receive path, above. 

The Crypto processing 112 can receive a packet from the 
Policy Processor 244 as described above. The Application 
Processor 302 or a PCI peer 332 or 314 or 316 can also 
enqueue the pointer to a buffer onto the Crypto Ring 420 to 
schedule that packet for Crypto processing 112. 

The TX MAC 218 or 234 transmits packets whose buffer 
pointer have been enqueued on the TX Ring 406 or 408, 
respectively. Those pointers may have been enqueued by the 
action code 106 running on the Policy Processor 244, by the 
Crypto processing 112, by the Application Processor 302, or 
by a PCI peer 332 or 314 or 316. When the TX MAC 
controller 222 or 232 has retired a buffer either by success- 
fully transmitting the packet it contains, or abandoning the 
transmit due to transmit termination conditions, it will 
optionally write back TX status 806 and TX Timestamp 808 
if programmed to do so, then will increment MTCONS 714 
to indicate that this buffer 840 has been retired. Hie Ring 
Translation Unit 264 detects that there is a difference 
between MTCONS 714 and MTRECOV 712 and signals to 
the Policy Processor 244 that the TX Ring 406 or 408 has at 
least one retired buffer to recover, this triggers the buffer 
recover process 118, which will dequeue the buffer pointer 
from the TX ring 406 or 408 and either send the buffer 
pointer to Buffer Allocation 102 or will add the recovered 
buffer to a software -managed free list for later use by Buffer 
Allocation 102. 

It is also possible for a device in the PCI expansion slot 
316 to play the role defined for the attached Crypto proces- 
sor 246 performing crypto processing 112 via DMA114 in 
this flow. 
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1. Communication and Buffer Management 
._ In certain embodiments,.the,buffer_memory consists of 16 _ 
to 128 MB of parity-protected SDRAM. It is used for packet 
buffers, for code and data structures for the microprocessor, 
as a staging area for Classification Engine microcode 
loading, and for buffers used in communicating with the AP 
and other PCI agents. The following uses of memory are 
defined by the architecture of the Policy Engine: 

Buffer Pointer rings for RX_MAC_A, KX_ MAC _B, 
TX_MAC_A, TX_MAC_B (where "RX" denotes 
"receive", "TX" denotes "transmit", and "_^A" and 
"_B" indicate which instance of the MAC is being 
described.) 

A pool of 2 KB- aligned buffers used for holding packets 
that are being processed in this chip as well as infor- 
mation about those packets; larger buffers can be cre- 
ated by concatenating these 2 KB buffers if needed for 
processing larger packets from other media. 
"Reclassification" pointer rings for each of the four Clas- 
sification Engines; these are used to schedule packets 
for processing on that CE, when the classification of the 
packet is being scheduled by an agent other than an RX 
MAC. 

A ring containing pointers to DMA descriptors used to 
schedule transfers using the DMA engine; data copies 
between PCI and memory in either direction are sched- 
uled by enqueueing descriptor pointers on this ring. 
A pool of memory allocated for use as DMA descriptors. 
A pointer ring for scheduling packets for processing on 

the Crypto unit. 
An area that contains ins tni^'ms fnr the mi croprocessor. 

including Jhe boot sequence. 
An area for staging microcode to be loaded into the 

control store of the four Classification Engines. 
Page tables for the Policy Processor MMU 
16 words dedicated to mailbox communications; writes to 
these words from the PCI bus also set the correspond- 
ing mailbox bit in the mailbox status register which 
signals to the processor that the indicated mailbox has 
a new message. 
A pool of 2 KB buffers that belong to the AP and are used 
for scheduling transmits of packets that have been 
45 handed to the AP for processing or that originate at the 
AP. 

In addition to these uses, parts of the memory may be 
allocated to the applications running on the PP for storing 
data such as local variables, counters, hash tables and the 
so data structures they contain, AP to PP and PP to AP 
application-level communications areas, external coproces- 
sor communication and transmit buffers, etc. 

The Policy Engine takes advantage of the fact that buffers 
are 2 KB-aligned, and has the hardware ignore the lower 11 
55 bits of each buffer base pointer, thus enabling software to use 
those pointer bits as tags. 

A simple and lightweight mechanism for buffer allocation 
and recovery is provided. Hardware support for atomic 
enqueue and dequeue of buffers through producer-consumer 
60 rings, along with detection of completed (retired) buffers 
enables buffer management in only a few instructions. In the 
realtime executive loop run on the PP, a short section is 
devoted to re claim ation of free buffers into the free list from 
those rings which indicate to the PP that they have retired 
65 buffers available for recovery. The RX pools of allocated, 
empty buffers maintained in the RX Rings can be replen- 
ished from the freelist each time a filled, classified RX buffer 
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is dequeued from that ring, thus maintaining the pool size. address. Reads and writes to the address associated with a 
A simple linked -list. of -buffers or_ other. method _wellrknown _ .__ particular. index register. actually. access memory at the.ring- 
to those versed in the art can be used to implement a entry pointed to by that index register; that is, such acceses 
software-managed freelist from which to feed the pools. are indirect. Some index registers are automatically incre- 
In order to support atomic enqueueing/dequeueing of 5 mented after an access (for atomic enqueue and dequeue 
buffer pointers and of DMA descriptors pointers, a standard operations), issued by leading producers or end consumers 
memory-based producer/consumer ring structure is sup- while others are incremented specifically by their owner 
ported in hardware for many purposes (as represented by the (generally an interim consumer-producer) to indicate that 
circle- with-arrow symbols in FIG. 3). In most cases one or the referenced buffer has been processed and is now avail- 
more of the consumers is also a producer for the next io able for the next consumer down the chain. Pairs of pointers 
consumer, so the rings have a series of index pointers which have a producer-consumer relationship, and a difference 
chase each other in sequence; for ex ample th e MAC RX between them indicates to the consumer that there is work to 
rings have a P roduce Pointer f5r ihe ai location of emp ty do; that difference is detected in hardware and is signaled to 
putters, a mal FILL Pointer for the RXMAC to consum e the appropriate unit. 

e mpty buffers and produce full buffers, a Classificatio n 15 There are 15 rings in the preferred embodiment, each 4 

Engine Consume Pointer for the CE to consume freel y KB in size (1 K entries of 4 bytes each); the 60 KB array of 

r eceived b utters a nd to produce classified buffers, and a 15 rings resides on a 64 KB boundary in memory. The base 

Policy Processor Consume Pointer for the PP to consum e of this array is pointed to by the Rings Base Register. The 

cla ssified packe ts as shown in F1G76. JJie Ieading_produce r rings themselves are not accessed directly; instead they 

accesses the ring through an "enqueue" register, and the en d 20 appear to the users as a set of "registers" which are read or 

consumer accesses the ring .through _ a "dequeu" registe r, written to access the entries in memory that are pointed to by 

"o bviating the need for mutexes (mutual exclusion locks) or the associated index register. For addressing purposes each 

(slow) memory accesses in a managing shared ring struc- ring is assigned a number, which is used as an index both 

cures, fotermm consumer-producers letch a butter "pointer into the array in memory and into the Ring Translation Unit 

tnrough a ring index, then increment that index later to 25 (RTU) register map. 

signal that they have finished processing the referenced Writes to a ring will cause the data (which is generally a 

buffer and that it is available for the next consumer. buffer pointer, or in the case of the DMA Ring, a pointer to 

The serialized multiple -producer/multiple-consumer ring a DMA descriptor) to be stored at the location in memory 

structure allows for one ring to support a compelled series of pointed to by [(RingArray[Ring #])+(RTU index register 

steps with much less hardware than would be required to 30 used)], and then that index register is incremented modulo 

support a separate FIFO between each producer and ring size. Reads from a ring will return the data (buffer 

consumer, and eliminates the need for each consumer- pointer or descriptor pointer) pointed to by [(RingArray 

producer to write pointers to the next ring; every cycle saved [Ring #])+(RTU index register used)]; if that register is an 

in a real-time system such as this can be significant. auto-increment register then it will increment modulo ring 

Hardware detects when there is a difference between a 35 size after the read operation. A read attempted via a con- 
producer's ring index and the ring index for the next sumer index register which matches its corresponding pro- 
consumer in that communication sequence, and signals to duce pointer (that is, there was not work to do) will return 
the consumer that there is at least one buffer pointer in its zero and the index pointer will not increment. Registers 
ring for processing; thus the presence of work to do wakes which are not auto-increment are incremented explicitly by 
up the associated unit, implementing a dataflow architecture 40 that register's owner when the referenced buffer has been 
through the use of hardware-managed rings. processed; the increment is done via a hardware signal, not 

Rings overflow, underflow, and threshold conditions are by register access, 

detected and reported to the ring users and the PP as Ring underflow/overflow and near-empty/near-full 

appropriate. threshold status (as appropriate) are reported through the 

2. Memory and Ring Translated Memory 45 CRISIS register to the PP and the AP. 

2.1 Memory _ II. Policy Engine 
Main memory in the preferred embodiment consists of up 

to 128 MB of synchronous DRAM (SDRAM) in two FIG. 3 shows a Policy Engine ASIC block diagram 

DIMM's (Dual In-line Memory Modules) or one double- according to certain embodiments of the present invention. 

sided DIMM. Detecting the presence of the DIMMs and 50^ The ASTe-290 contains an interface 206 to an external 

their attributes uses the standard Serial Presence Detect RISC microprocessor which is known as the Policy Proces- 

interface, using the SPD register to manage accesses to the sor 244. Internal to the Processor Interface 206 are registers 

serial PROM. (The same interface is used to access a serial for all units in the ASIC 290 to signal status to the Policy 

PROM containing MAC addresses, ASIC configuration Processor 244. 

parameters, and manufacturing information.) Depending on 55 , There is an interface 204 to a host PCI Bus 280 which is 

the size of DIMM's installed, memory might not be con- used for movement of data into and out of the memory 260, 

tiguous; each socket is allocated 64 MB of address space, and is also used for external access to control registers 

and will alias within that 64 MB space if a smaller DIMM throughout the ASIC 290. The DMA unit 210 is the Policy 

is used. Alternatively one 128 MB DIMM is supported, on Engine 322's agent for master activity on the PCI bus 280. 

one socket only. 60 Transactions by DMA 210 are scheduled through the DMA 

2.2 Ring Translated Memory Ring 418. The Memory Controller 240 receives memory 
The pointer rings associated with various units are simply access requests from all agents in the ASIC and translates 

a region of memory which is accessed through a translation them to transactions sent to the Synchronous DRAM 

unit. The translation unit implements the rings as a base Memory 260. Addresses issued to the Memory Controller 

register (which is used to assign an arbitrary memory 65 240 will be translated by the Ring Translation Unit 264 if 

location to be used for the rings) plus a set of index registers address bit 27 is a ' 1', or will be used untranslated by the 

which each point to an array entry relative to the base memory controller 240 to access memory 260 if address bit 
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27 is a '0*. Untranslated addresses are also examined by the 
Mailbox Unit 262 and if the address matches the_memory 
address of one of the mailboxes the associated mailbox 
status bit is set if the transaction is a write, or cleared if the 
transaction is a read. In addition to the dedicated rings in the 
Ring Translation Unit 264 which are described here, the 
Ring Translation Unit also implements 5 general-purpose 
communications rings COM[4:0] 226 which software can 
allocate as desired. The memory controller 240 also imple- 
ments an interface to serial PROMs 270 for obtaining 
information about memory configuration, MAC addresses, 
board manufacturing information, Crypto Daughtercard 
identification and other information. 
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T he ASIC contains two Fast Ethernet MACs MAC A 
and MAC B. Each contain s a receive MAC 216 o r 230, 
respectively with associated control logic and an interface to 
the memory unit 220 or 228, respectively; and a tr ans m it 
MAC 218 or 234respectivelv_with associated control' log ic 
and an interface to the memory unit 222 or 232, respectively , 
Also associated with each MAC is an RMON counter unit 
224 or 236, respectively, which counts certain aspects of all 
packets received and transmitted in support of providing the 
Ethernet MIB as defined in Internet Engineering Task Force 
(IETF) standard RFC 1213 and related RFC's. 

RX_A Ring 402 is used by RX MAC_jV controller 220 
to obtain empty buffers and to pass filled buffers to Classi- 
fication Engine 238. Similarly RX_B Ring 404 is used by 
RX MAC J controller 228 to obtain empty buffers and to 
pass filled buffers to Classification Engine 242. TX^ARing 
406 is used to schedule packets for transmission on TX 
MAC_A 218, and TX^JB Ring 408 is used to schedule 
packets for transmission on TX MAC_B 234. 

There are four r i a s fiificalion Fnfrinps 208 r 2Xl y 238 n and 
2 42 which are microprogrammed processors optimized fo r 
t he predicate analysis associated with packet filtering.. The 
classificat ion engines are described in FIG. 13. Packets are 
s cheduled tor processing by these engines through the use o f 
trie Reclassify Rin gs 412, 416, 410, and 414 respectivel y, 
p lus the UX MAC controll ers MaC_A220 a nd MAC_B 
228 can scneduie pac^eT5"1tTTroce^sfnTl5Y^^ 
Engines 238 and 242, respectively, through use of the RX 
Ri ngs 402 and 404, respectively. 

Ihere is Crypto Processor Interface 202 which enables 
attachment of an encryption processor 246. The Policy 45 
Processor 244 can issue reads and writes to the Crypto 
Processor 246 through this interface, and the Crypto Pro- 
cessor 246 can access SDRAM 260 and control and status 
registers internal to the interface 202 through use of inter- 
face 202. 50 

A Timestamp counter 214 is driven by a stable oscillator 
292 and is used by the RX MAC logic 220 and 228, the TX 
MAC logic 222 and 232, the Classification Engines 208, 
212, 238, and 242, the Crypto Processor 246, and the Policy 
Processor 244 to obtain timestamps during processing of 55 
packets. 

Preferably, the Policy Engine Units have the following 
characteristics: 
1. PCI Interface 

33 MHz operation. 60 

32/64-bit data path. . 

32-bit addressing both as a target and as an initiator. 
Initiator and Target interface. 

One interrupt output. 65 
Up to 32-byte bursts as a master; up to 32-byte bursts to 
memory (BAR0) as a target (disconnects on 32-byte 



boundaries), single data-phase operations as a target for 
- Register (BAR1) and -Ring. Translation unit (BAR2) 
spaces. 

Single configuration space for the entire device. 

2. RISC Processor Interface 

Interface to external SA-110 StrongARM processor, run- 
ning the bus at ASIC core clock or half core clock as 
programmed in the Processor Control and Status Reg- 
ister. 

Handles all transaction types for PIO's (reads and writes 
of I/O registers), cache fills/spills, and non-cached 
memory accesses. 

Low- and high-priority interrupt signals, driven by 
enabled bits of PISR and PCSR. 

Boosts from main memory; an external agent must ini- 
tialize memory, download local initialization code etc., 
and release processor reset to enable operation. 

Support for remap of the trap/reset vector to any location 
in PE Memory. 

3. Classification Engine 

Microcoded engine for accelerating comparisons and 
hash lookups. 

Runs a set of comparisons on fields extracted from 32-bit 
words within a packet to offload processor. 

Operations can be on fields in the packet, or on pairs of 
result bits from previous comparisons. 

Produces a result vector of one bit result for each com- 
parison for each boolean operation on pairs of bits in 
the vector (selected bits of which are then stored in a 
data structure in the 2 KB packet buffer). 

Can also execute one or more hash lookups on one or 
more tables based on keys extracted from the packet. 
Optimized for linked list chasing through the use of 
non-blocking loads and speculative fetch of the next 
record; searches of hash tables implementing conflict 
resolution by chaining are thus accelerated. The hash 
lookup results are also stored in the packet buffer in 
memory. 

Arbitrary fields can be extracted from the packet and 
returned in the packet's data structure to the PP. Arbi- 
trary computation on extracted fields and result vector 
bits which yield multi-bit results can also be done in the 
CE, and the results returned to the PP in the data 
structure. 

The above computations could also incorporate operands 
found in hash table records found during the above 
hash searches. 

The contents of hash table records found using keys 
extracted from the packet can be updated with results of 
computations such as those described above. 

Supports fast TCP/IP checksum calculation via use of the 
"split-add" unit. 

pecisions and branches are supported. 

Comparisons, extractions and comput ations r ancLhashi ng 
a re run speculatively before the pac k et js handed to th e 
P olicy Processor: ii the code on the PP fthe Actio n 
5 ffrtinn of thp appHcatiop^ needs to run rules against th e 
p acket, the comparisons are done and ready for it to 
use, with single-bit decisions ("predicate analysi s 
"r esults") for each_policy to apply. Similarly, if the 
Act ion code needs to compute or extract informatio n 
about the packet, the results of mat "computation a re 
al ready available in the packet's da ta structure. _ ~ 

Packets are scheduled for classification from both the RX 
MAC ring and a reclassification ring for the "Inbound" 
CEs, from a reclassification ring alone for "Outbound" 
CEs. 
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4. Ethernet MACs 

. StandardlO/100 Mbit IEEE 802.3 u-compliant MAC with . 
Mil interface to external PHY 

Each RX MAX has support for a single unicast address 
match, multicast hash filter, broadcast packets, and 5 
promiscuous mode. 

Serial MH management interface to PHY 

RX MAC inserts packets along with receive status into 2 
KB-aligned buffers, with the packet aligned so that the 
IP header is on a 32-bit boundary; keeping the receive 10 
buffer ring replenished with empty buffers is the only 
processor interaction with the MAC (i.e. there is no 
run-time device driver needed for the MAC). 

Xransmit MAC follows a ring of buffer pointers : sched - 
u ling of transmi t buffers J t oui any source is supported 15 
tnrougn a register which makes enqueuing atomic, thu s 
allowing multiple masters to schedule transmits with - 
out mutexes. _ 

Mo'de bit for PASS or DROP of bad ethernet packets 
(CRC errors etc.). 

Hardware counters to support RMON ETHER statistics 
^gathering. 

MXCs operate on 2.5 MHZ/25 MHz RXCLK and 
TXCLD from the external Fast Ethernet PHY, each has 
its own clock domain and a synchronizing interface to 
the ASIC core. 

5. Memory Controller 
Manages up to two DIMMs of SDRAM. 
Aggressively schedules two banks independently for 

high-performance. 
Arbitrates among many agents; priorities are: 

1) MAC_A, MAC_B ping-pong (top prio); internal to 
each MAC, the TX and RX units arbitrate locally for the 
MAC's memory interface, with ping-pong priority 

2) Round -robin priority among PP, CE_AI, CE^AO, 
CE_BI, CE_JBO, DMA, PCI_Target, Crypto 

Supports different speed grades of SDRAM, program- 
mable timing. 

Parity generation and checking. 

Serial Presence Detect (SPD) interface. 

Contains the Ring Translation Unit for mapping Ring 
accesses to Memory addresses. 

Contains the Mailbox address-matching and status unit. 

6. DMA Engine 
Can be used by PP, Crypto, and also by the host 

(Application Processor) and PCI peer devices. 
Moves word-aligned bursts of data between SDRAM and 
PCIbus. 

Data is transferred between memory and PCI in byte lane 
order, for endian- neutral transfers of byte streams. See 
"Endianness" in Section 8. 

Each DMA is controlled by a 16-byte descriptor; the 
initiator first constructs a descriptor, then enqueues a 
pointer to that descriptor on the DMA Ring to schedule S5 
the transfer. 

Atomic enqueueing is supported to eliminate locks when 
scheduling DMAs. 

At completion of each DMA, the unit can optionally set 
one of 8 status bits in the PISR (Processor Interrupt eo 
Status Register) or one of 8 status bits in the HISR 
(Host Interrupt Status Register), as indicated in the 
descriptor. 

DMA engine ignores lower 11 bits of the SDRAM 
address, using a separate "buffer offset" instead. This is 65 
to support the buffer tag field in the buffer pointer used 
by software. 
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Descriptor is defined in "DMA Command Queue and 

_ _ Descriptors" in Section 6. _ _ 

PCI command code is carried in the descriptor for flex- 
ibility. 

7. Crypto Control 

PE ASIC hosts a 32-bit PCI bus for connecting to the 
Crypto coprocessors), with two external request/grant 
pairs and two interrupt inputs. PP can directly access 
devices on this bus. 

4 BAR's ("Base Address Registers", which are part of the 
PCI standard) are supported: BAR0 for Memory, 
BAR1 for access to the ring status bits, BAR2 for 
access to the rings, and BAR3 for prefetched access to 
Memory. 

Packets are scheduled for encryption by placing a Crypto 
descriptor in a data structure in the packet buffer in 
memory, then enqueueing the pointer to that buffer in 
the Crypto Ring. (Communication Ring 4 is also avail- 
able for similar use with a second processor. 

The Crypto chip will detect queue-not-empty by polling 
the CSTAT (Crypto Status Register) register and will 
dequeue the buffer pointer at the head of the queue for 
processing. Two rings are available so that up to two 
devices can be supported for this function. 

After processing a packet., the Crypto chip will write the 

rpmltr.hnrlr fn mr- mnrv mH fhpn rnirpifiip. ihr frmjfir 

pointer on the specified destinat io n ring (for fart her 
class ification, for examination on the PP, for DMA to a 
ta rget on the PCI bus, or for tran smits 

8. Mailbox Unit " ' 

Monitors 16 word-sized mailboxes in memory space. 
On address match, sets(clears) the status bits in the 

Mailbox Status Register associated with the word 
written(read). Selected status bits contribute to a Mail- 
box Attention status bit in the PISR. 

9. Ring Translation Unit 

Base pointer to a 64 KB region of memory (only the first 
60 KB are used, 4 KB remainder is available for other 
use). 

Maintains 15 rings as memory arrays of 1 K 32-bit entries 
each. 

Reads and writes to rings through the RTU are mapped to 
locations in these arrays. 

Some index registers auto-increment, others are incre- 
mented by their owner. 

Delta between producer-consumer index pairs is detected 
in hardware. Any delta is signaled to the consumer 
indicating that there is work to do. 

10 of the rings have specific assignment as shown in FIG. 
3. 

5 general-purpose rings COM[4:0] are provided for soft- 
ware to allocate as desired; expected use includes a 
freelist for DMA descriptors and a freelist of buffers for 
the AP or peers to use, messages- in to the PP, and 
others. COM4 can optionally be used as a second 
Crypto ring. 

Overflow/underflow and threshold conditions are detected 
and reported through the CRISIS register in the Policy 
Processor interface. 

10. Global TIMER 

32-bit up-counter driven from an external, asynchronous 
clock source. 

Counts at 1 juS in bit 3 (leaving room for finer granularity 
in future higher speed implementations.) Counter rolls 
over approximately every 536.87 seconds. 
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Status bit in PISR/HISR sets on every transaction (high- 
low and low-high) in-bit[30]_to simplify software 
extension of the timer value. 

An Ethernet crystal (buffered copy) is used as the clock 
source since it is the most stable timebase available. 
Runs at 25 MHz. 

In multi-PE implementations, all PE's receive the same 
clock source to avoid relative drift in timestamps. In 
systems using multiple PCI cards each containing a PE 
they each receive a local, non-aligned clock. 

Used by MACs, Classification Engines, and PP for mark- 
ing events; used for monitoring performance and 
packet arrival order as needed. 
11. Serial PROM 

Support for a 24C02 256-byte serial PROM at serial 
address 0x7; the memory DIMMs are at addresses 0x0 
and 0x1 for slots 0 and 1 (if supported). 

PROM at 0x7 contains two MAC addresses, full/half- 
speed control indication for the processor bus, manu- 
facturing information, and other configuration and 
tracking information. 

Additional devices on the SPD bus include a Crypto 
Daughtercard IDPROM at address 0x6, and a thermal 
sensor at address 0x4. 

III. Data Structures 
1. Ring Array in Memory 

The 15 rings are packed into a 60 KB array aligned on a 
64 KB boundary in memory. The RING_BASE register 
points to the start of this array. Each ring is 4 KB in size and 
can hold up to 1 K entries of 32 bits each. 

FIG. 5 illustrates a ring array in memory. 

The Ring Translation Unit (RTU) 264 manages 15 arrays 
in memory 260 for communication purposes. Each ring 
actually consists of 1024 32-bit entries in memory for a total 
of 4 KB per ring, along with index registers and logic for 
detecting differences between the index register for a pro- 
ducer and the index register for the associated consumer, 
which is reported to that consumer as an indication that there 
is work for it to do. Various near-full-threshold, near-empty- 
threshold, full, arid empty conditions are detected as appro- 
priate to each ring and are reported to the ring users and to 
the Policy Processor 244 as appropriate. The RTU 264 
translates Ring accesses into both a memory 260 access at a 
translated address, and in some cases into commands to 
increment specific index pointers after completing that 
memory access. Each ring is assigned a number for mapping 
purposes, and that number is used to index into the array of 
memory 260 in which the rings are implemented. The index 
registers are incremented modulo 4 KB so that FIFO behav- 
ior is achieved. Each index register contains one more 
significant bit than is used for addressing, so that a full ring 
can be differentiated from an empty ring. 

A Ring Base Register 400 selects the location in memory 
260 of the base of the 64 KB-aligned array 440 represented 
in FIG. 5. The structure is an array of arrays; there is an array 
of 15 rings indexed by the ring number, and each of those 
rings is a 4 KB array of 1024 32-bit entries indexed by 
various index registers used by different agents. 

RX_A Ring 402 and RX_B Ring 404 implement the 
structure described in FIG. 6, and are associated with the 
receive streams from RX MAC_A220 and RX MAC_B 
228 respectively. TX_A Ring 406 and TX_B Ring 408 
implement the structure of FIG. 8, and are associated with 
the transmit MACs 222 and 232 respectively. The Reclassify 
Rings 410, 412, 414, and 416 are used to schedule packets 
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for classification on Classification Engines 238, 208, 242, 
and- 2 12 respectively, and implement the structure shown in 
FIG. 10. 

DMARing 418 is used to schedule descriptor pointers for 
consumption by DMA Unit 210, and implements the struc- 
ture shown in FIG. 12. Crypto Ring 420 is used to schedule 
buffers for processing on the Cry to Processor 246 and 
implements the structure shown in FIG. 11. The five general 
purpose communication rings COM[4:0] are available by 
assignment by software and also implements the structure 
shown in FIG. 11. 

2. RX Buffer Pointer Ring and Produce/Consume Pointers 
A ring of buffer pointers resides in the memory for each 
RX MAC, Associated with this ring are produce and con- 
sume index pointers for the various users of these buffers to 
access specific rings. The Policy Processor allocates free, 
empty buffers to the MAC by writing them to the associated 
MPPROD address in the Ring Translation Unit (RTU), 
which writes the buffer address into the ring and increments 
the MPROD pointer modulo ring size. The RX MAC chases 
that pointer with the MFILL index which is used to find the 
next available empty buffer. That pointer is chased by 
MCCONS which is used by the Classification Engine to 
identify the next packet to run the classification microcode 
on. The PP uses a status bit in the PISR to see that there is 
at least one classified packet to process, then reads the ring 
through MPCONS in the RTU to identify the next buffer that 
the PP needs to process. 

FIG. 6 shows an RX Ring structure related to certain 
embodiments of the present invention. There are two RX 
Rin^s 402 and 404. Each is located in the Ring Array in 
memory 260. Each has four index registers associated with 
it. FIG. 6 shows the ring as an array in memory with lower 
addresses to the top and higher addresses to the bottom of 
the picture. 

The ring's base address 510 is a combination of the Ring 
Base Register 400 and the ring number which is used to 
inde*5T into the Ring Array 440 as shown in FIG. 5. Two 
instances of the set of four index registers MPCONS 512, 
MCCONS 514, MFILL 516, and MPROD 518 are used to 
provide an offset from the RX Ring Base 510 of the 
particular ring 402 or 404, each of which is a 4 KB array 
520. 

MPROD 518 is the lead producer index for this ring. The 
Policy Processor 244 or the Application Processor 302 
enqueues buffer pointers into the RX Ring 402 or 404 by 
writing the buffer pointer to the RTU's enqueue address for 
the particular ring 402 or 404, which causes the RTU to write 
the buffer pointer to the location in memory 260 referenced 
by MPROD 518, and then to increment MPROD 518 
modulo the ring size of 4096 bytes. This process allocates an 
empty buffer to the RX MAC MAC_JV or MAC_B asso- 
ciated with ring 402 or 404 respectively. 

MPROD 518 and MFILL 516 have a producer-consumer 
relationship. Any time there is a difference between the 
value of MPROD 518 and MFILL 516, the RTU signals to 
the associated RX MAC MAC_A or MAC 13 B that it has 
empty buffers available. The region 506 in the RX Ring 402 
or 404 represents one or more valid, empty buffers that have 
been allocated to the associated RX MAC by enqueueing the 
pointers to those buffers. 

When the RX MAC MAC_A or MAC_B receives a 
packet, it obtains the buffer pointer referenced by its asso- 
ciated MFILL pointer 516 by reading from the RTU's 
MFILL address and then writes the packet and associated 
RX Status 600 and RX Timestamp 602 into the buffer 
pointed to by that buffer pointer. When the RX_MAC has 
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successfully received a packet and has finished transferring 
it into the buffer, _i t„ increments the index M FILL 516 by a. _ 
hardware signal to the RTU which causes the RTU to 
increment MFILL 516 modulo the ring size of 4096 bytes. 
MFILL 516 and MCCONS 514 have a producer-consumer 5 
relationship; when the RTU 264 detects a difference between 
the value of MFILL 516 and MCCONS 514 it signals to that 
ring's associated Classification Engine 238 or 242 that it has 
a freshly received packet to process. The region 504 in the 
ring array contains the buffer pointers to one or more full, 10 
unclassified buffers that the RX MAC has passed to the 
associated Classification Engine. 

The Classification Engine 238 or 242 receives a signal if 
the RTU 264 detects full, unclassified packets in RX Ring 
402 or 404, respectively. When the dispatch microcode on 15 
that CE 238 or 242 tests the ring status and sees this signal 
from the RTU 264, that CE 238 or 242 obtains the buffer 
pointer by reading from the RTlPs MCCONS address for 
that ring. When the CE 238 or 242 has finished processing 
that buffer and has written all results back to memory 260, 20 
it signals to the RTU 264 to increment its associated 
MCCONS index 514. Upon receiving this signal the RTU 
264 increments MCCONS 514 modulo the ring size of 4096 
bytes. By sending the signal, the CE 238 or 242 has 
indicated that it is done processing that packet and that the 25 
packet is available for the consumer, which is action code 
108 running on the Policy Processor 244. The region 502 
contains the buffer pointers for one or more full, classified 
packets that the Classification Engine has passed to the 
Action Code 108. 30 

MCCONS 514 and MPCONS 512 have a producer- 
consumer relationship. When the CE 238 or 242 has pro- 
duced a full, classified packet then that packet is available 
for consumption by the action code 108. The RTU detects 
when there is a difference between the values of MCCONS 35 
514 and MPCONS 512 and signals this to the Policy 
Processor 244 through a status register in the Processor 
Interface 206. The Policy Processor 244 monitors this 
register, and when dispatch code*on the Policy Processor 244 
determines that it is ready to process a full, classified packet 40 
it dequeues the buffer pointer of that packet from the RX 
Ring 402 or 404, as appropriate, by reading the RTU's 
dequeue address for that ring. This read causes the RTU to 
return to the Policy Processor 244 the buffer pointer refer- 
enced by that ring's MPCONS index 512, and then to 45 
increment MPCONS 512 modulo the ring size of 4096 
bytes. The act of dequeueing the buffer pointer means that 
the pointer no longer has any meaning in the RX ring. The 
contents of the ring in locations between MPCONS 512 and 
MPROD 518 have no meaning, and are indicated by the so 
Invalid regions 500 and 508. Since this is a ring structure 
which wraps, 500 and 508 are actually the same region; in 
the figure shown, due the current values of the ring index 
pointers 512, 514, 516, and 518 the Invalid regions 500 and 
508 happens to wrap across the start and end of the array 55 
containing this ring, but it should be obvious to one skilled 
in the art that under normal circumstances these ring index 
pointers can have different values and any of regions 502, 
405, or 506 could also be region which wraps around the end 
and beginning of the array 520. 60 

2.1 RX Buffer Structure 

The receive data buffer is a 2 KB structure which contains 
an Ethernet packet and information about that packet. A 
substantially similar format is used for transmitting the 65 
packet, as indicated in FIG. 8. The packet offiset from the 
base of the buffer is designed so that upon receive the Ether 



header is offset by two bytes into a word, thus aligning the 
IP.beader on ajword (32-bit) boundary. Enough.space is left 
before the packet so that encapsulation/encryption headers 
(e.g., up to 40 bytes for a standard IPv6 header plus AH and 
ESP) can be inserted for encapsulation of the packet without 
copying the packet, by just copying the Ethernet header up 
to make space and then inserting the encapsulation headers. 
The total pad size is 112 Bytes; if more is needed then the 
Crypto Coprocessor can realign the packet when writing it 
back. 

The RX MAC can be programmed to either drop bad 
packets or receive them normally; if the latter, then error 
status is also shown in the buffer RX status field. 

FIG. 7 illustrates the receive buffer format. 

A packet is passed around the system by placing it into a 
packet buffer 620 and then passing the 2 KB -aligned buffer 
pointer among units via pointer rings implemented by the 
RTU 264. The RX Status and Transmit Command Word 600 
is always located at the word pointed to by the 2 KB- aligned 
buffer pointer. All hardware in the Policy Engine 322 is 
designed to assume that a buffer pointer is 2 KB-aligned and 
to ignore bits [10:0], which allows software to use bits 
[10:0]of the buffer pointer to carry software tag information 
associated with that buffer. 

Upon receiving a packet the RX MAC 220 or 228 places 
that packet at an offset of (130) bytes from the beginning of 
a buffer 620, and writes zero to the bytes at byte offset (128) 
and (129) from the beginning of that buffer; these two bytes 
are called the Ethernet Header Pad 618. The packet consists 
of the (14)-byte Ethernet header 610 and the payload 612 of 
the Ethernet packet, which are stored contiguously in the 
buffer 620. The reason for inserting the Ethernet Header Pad 
is to force protocol headers encapsulated in the Ethernet 
packet to be word (32-bit) aligned for ease in further 
processing, encapsulated protocols such as IP, TCP, UDP etc 
have word-oriented formats. 

The RX MAC control logic 220 or 228 then writes the RX 
Status Word 600 into the buffer 620 at an offset of (0) from 
the start of the buffer, and an RX Timestamp 602 as a 32-bit 
word at byte offset (4) from the start of the buffer 620. The 
RX Status Word has the format shown in Table 1. The 
timestamp is the value obtained from the Timestamp Reg- 
ister 214 at the time the RX status 600 is written to the buffer 
620. The TX Status Word 604 and the TX Timestamp 606 
are not written at this time, but those locations covering the 
two 32-bit words at offsets of 8 and 12 bytes, respectively, 
from the start of the buffer 620 are reserved for later use by 
the TX MAC controllers 222 and 232. 

The format for the RX Status word in Table 1 is such that 
is can be used directly as' a TX Command Word without 
modification; the fields LENGTH and PKT_OFFSET have 
the same meaning in both formats. The RX MAC controller 
220 or 228 subtracts (4) bytes from the Ethernet packet's 
length before storing the LENGTH field in the RX Status 
Word 600 such that the (4-by te) Ethernet CRC is not counted 
in LENGTH, so that the buffer can be handed to a TX MAC 
222 or 232 without need for the Policy Processor 244 
modifying the contents of the buffer. 

Pad Space 608 is left before the start of packet 610 and 
612 in buffer 620 to support the addition of encapsulating 
protocol headers without copying the entire packet. Up to 
(112) bytes of encapsulation header(s) can be inserted sim- 
ply by copying the ethemet header 610 (and possibly an 
associated SNAP encapsulation header in the start of pay- 
load 612) upwards into the Pad Space 608 by the number of 
bytes necessary to make room for the insertion headers, 
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which are then written into the location that was opened up 
for them in_areas„608,_610, and 612 as needed. If more than 
(112) bytes of encapsulation header are being inserted then 
the entire payload 612 must be copied to a different location 
in the buffer to make room for the inserted headers. 5 

The per-packet software data structure 614 is used by the 
classification 106, action code 108, encryption processing 
112, the host 302 and PCI peers 322, 314, and 316 to carry 
information about the packet that is carried in the buffer 620. 
The location of the software data structure 614 and the sizes 1Q 
of the packet header 610 and packet payload 612, as well as 
the total size of the packet buffer 620 are not hard limits in 
the preferred embodiment. The 2 KB-ahgnment of the RX 
status word 600 and RX Timestamp are enforced by the 
hardware; but packets from other sources and also from 
other media besides Ethernet can be injected into the clas- 
sification flow of FIG. 2 as follows. The SOURCE field of 
the RX status word 600 as shown in Table 1 has only a few 
reserved codes; the rest can be assigned by software to 
identify packets from other sources and also from other 
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and hash pointers output by the Classification Engine, a 
command-descriptor for the Crypto Unit, buffer reference - 
counts, an optional pointer to an extension buffer, and nay 
other data structures that the software defines. "TX Status/ 
TX Timestamp" is optionally written by the transmit MAC 
if it is programmed to do so, that field contains garbage after 
an RX. 

The "RX Timestamp" field contains the 32 -b it va lue of 
t he chip's TIMER register at the time that the packet was 
successfully received (approximately the time of receipt of 
t he^ end of packet) and the RX_STATUS field was written . 
The "RX Status" field is one 32-bit word wuFthe following 
format. 

^ote throughout this document that bit [31 is the le ft 
(m ost significant ) bit of a 32-bit word , and bit [0] is right 
Ueast significant). "MCSR" mentioned in fable 1, below 7 l s 
the MAC Control and Status Register . 



TABLE 1 

Ethernet RX Status Word and TX Command Word Format 



Bit 



Field 



Description 



[31] 



BAD_PKT 



[30] CRC __ERR 

[29] RUNT 

[28] GIANT 

[27] PREAMB_ERR 

[26:16] LENGTH 

[15] DRBL_ERR 

[14] CODE_ERR 

[13] BCAST 

[12] MCAST 

[11:08] SOURCE 

[07:00] PKT_OFFSET 



Summary error bit; set if any of [30:27, 15:14] is set, which can only happen if the MAC is 
programmed to receive bad frames. 

Ethernet frame had incorrect CRC and (MCSR[RCV_BAD>=1) for this MAC. 
Ethernet frame was smaller than legal and (MCSR[RCV_BAD]==1) for this MAC 
Ethernet frame was larger than legal and (MCSR[RCV_BAD]--1) for this MAC 
Invalid preamble and (MCSR[RCV__BAD]==1) for this MAC This error is associated with 
some previous event, not with the current packet. 

For RX, number of bytes in the Ethernet frame including the Ethernet header but not including 

the Eternet CRC For TX, length of packet, including CRC if (MCSR[CRC_EN]==0) 

Odd number of nibbles received (dribble) and (MCSR[RCV_BAD]— 1) for this MAC 

4b/5b encoding error and (MCSR[RCV_BAD>»=1) for this MAC 

The received packet was a broadcast packet (destination address is all 1's) 

The received packet was a multicase packet and was passed by the multicast hash filter 

This indicates the source of the packet or other source as marked later by software. If the packet 

was generated at a RX MAC then this field is 0x0 for MAC__A or 0x1 for MAC_B. 

This is the byte oflfeet from the beginning of the packet buffer to the first byte of the Ethernet 

header. Other agents may choose to move this offset in order to encapsulate the IP packet or to 

strip of encapsulation headers. The CE, PP, and AP all use this offset when accessing the frame 

in this buffer. The RX MAC will always write a value of 0x82 into this field, indicating that the 

Ethernet Frame was received into the buffer starting at byte offset 130 from the start of the 

buffer. 



media which do not share the packet format or packet size 
of Ethernet. By software convention large buffers can be 
assigned by grouping contiguous 2 KB buffers together and 
treating them as one buffer; the pointer to this large buffer 
602 will still be 2 KB-aligned and the RX Status Word 600 
and RX Timestamp 602 will still reside at that location in the 
buffer. The packet area 610 and 612 can be arbitrarily large 
to accommodate a packet from a different medium. The 
location of the software data structure 614 can be moved 
downwards as the larger payload space is allocated. Alter- 
natively the software can choose to allocate buffers so that 
they have space before the 2 KB-aligned RX Status Word 
600, and carry the software data structure 614 above the RX 
Status Word 600 rather than below the Payload 612 as shown 
in FIG. 7. The advantage of this second approach is that the 
location of the software data structure is always known to be 
at a fixed location relative to the RX Status Word 600, rather 
than having that location be a variable depending on differ- 
ent media and the resulting variations in the size of the 
packet payload 612. 

The section marked "Available for software use" contains 
transient per-packet information such as the result vector 



45 

The same packet buffer format is used for encryption and 
transmission; for those uses the only meaningful fields are 
LENGTH, PKT_OFFSET and the contents of the Ethernet 
frame found at that offset; plus for encryption the encryption 
50 descriptor included in the "Software" area in the buffer. 

3. TX Buffer Pointer Rings and Producer/Consumer 
Pointers 

A packet gets scheduled for transmission by enqueueing 
ss the address of the buffer onto the pointer queue for that 
transmit MAC, by writing it to MTPROD in the RTU (MAC 
A and MAC B each have their own ring and associated 
registers). Any time the produce pointer is not equal to the 
consume pointer for that ring, the associated MAC will be 
60 notified that there is at least one packet to transmit and will 
follow the pointer to obtain the next buffer to deal with. 
When the packet has been retired the TX controller will 
write back status if configured to do so, then increment the 
consume pointer and continue to the next buffer (if any). 
65 The recover pointer is used to track retired buffers (either 
successfully transmitted or abandoned due to transmit ter- 
mination conditions) for return to the buffer pool, or possibly 
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for a retransmit attempt; the PP is signaled by the RTU that 
there is a delta between MTCONS.and MTRECO V, and then 
reads the Ring through the RTU register MTRECOV to get 
the pointer to the next buffer to recover. MTPROD, 
MTCONS, and MTRECOV are duplicated for each instance 
of a transmit 

FIG. 8 illustrates the TX Ring Structure according to 
certain embodiments of the present invention. 

The TX Rings 406 and 408 have substantially the same 
structure as the RX Rings described previously. The funda- 
mental differences are that there is one few interim producer- 
consumer using this ring, and that this ring is assigned for a 
different function with different agents using it. Each ring 
406 and 408 is a 4096-byte array 720 in memory 260. 

A packet is scheduled for transmit on the TX MACs 222 
or 232 by enqueueing a pointer to the buffer containing the 
packet onto TX Ring 406 or 408, respectively. The buffer 
pointer is enqueued onto 406 or 408 by any agent, by writing 
the buffer pointer to the RTU 264 enqueue address for that 
ring. The RTU 264 writes the buffer pointer to the location 
in memory 260 referenced by the MTPROD index register 
716, and then increments MTPROD 716 modulo the ring 
size of 4096 bytes. There is a producer-consumer relation- 
ship between MTPROD 716 and MTCONS 714; when the 
RTU detects a difference in the values of MTPROD 716 and 
MTCONS 714 it signals to the associated TX MAC con- 
troller 222 or 232 that there is a packet ready to transmit. The 
region 706 in the TX Ring 406 or 408 contains one or more 
buffer pointers for the buffers containing packets scheduled 
for transmission. 

Hie TX MAC controller 222 or 232 obtains the buffer 
pointer for the buffer 620 containing this packet by reading 
the RTU's MTCONS address for TX Ring 406 or 408, 
respectively, which causes the RTU to return to the MAC the 
buffer pointer in memory 260 referenced by MTCONS 714. 
When the TX MAC 218 or 234 has successfully transmitted 
this packet or has abandoned transmitting this packet due to 
transmit termination conditions, its controller 222 or 232 
respectively will optionally write back TX Status 806 and 
TX Timestamp 808 if it has been configured to write status, 
then retires the buffer by signaling to the RTU 264 to 
increment MTCONS 714. Upon receiving this signal the 
RTU 264 will increment MTCONS 714 modulo the ring size 
of 4096 bytes. 

Index registers MTCONS 714 and MTRECOV 712 have 
a producer-consumer relationship. When the RTU detects a 
difference in their values, it signals to the PP that the 
associated TX ring 406 or 408 has a retired buffer to recover. 
That information is visible to the Policy Processor 244 in a 
status register in Processor Interface 206 which the Policy 
Processor 244 polls on occasion to see what work it needs 
to dispatch. Upon testing the RECOVER status for the TX 
Ring 406 or 408 and detecting that there is at least one buffer 
to recover, the Buffer Recovery code 118 reads the RTU's 
264 MTRECOV address for that ring to dequeue the buffer 
pointer from the TX ring 406 or 408. The read causes the 
RTU to return the buffer pointer reference by MTRECOV 
712, and then to increment MTRECOV 712 modulo the ring 
size of 4096 bytes. Hie region 704 contains the buffer 
pointers of buffers which have been retired by the TX MAC 
222 or 232 but have not yet been recovered by the Buffer 
Recovery code 118. 

The regioqg-702 and 708 are the same region, which in the 
figure shown are spanning the end and the beginning of the 
array 720 in memory 260 which contains the TX Ring 406 
or 408. This region contains entries which are neither a 
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buffer pointer to a buffer ready for transmit, nor a buffer 
pointer to a buffer which the TX MAC 222 or 232 has retired 
but the recovery code 118 has not yet dequeued. For the 
purposed of a TX Ring 406 or 408 this region consists of 

5 space into which more packets may be scheduled for trans- 
mit. One skilled in the art will recognize that region 704 or 
region 706 could just as easily be the region wrapping 
around the array boundary, depending on the values of 
MTRECOV 712, MTCONS 714, and MTPROD 716. 

10 Embedded in the buffer is the packet length in bytes 
(including the Ethernet header, but not including the CRC 
since the TX MAC will generate that) and also the byte 
offset within the buffer where the Ethernet header begins. 
The offset is necessary since the start of packet might have 

15 been moved back (if adding encapsulation headers) or 
forward (if decapsulating a packet.) The Ethernet header 
typically starts at byte offset 0x2 within that word, but the 
TX MAC supports arbitrary byte alignment. PKT_OFFSET 
and LENGTH are found in the "RX Status" and "TX 

20 Command" word of the buffer as described in Table 1; for 
transmit purposed those are the only two meaningful fields 
in that word. 

The area labeled "TX Status/TX Timestamp" is optionally 
written with one word of transmit status plus the value of 

25 TIMER at the time the field is written, if MCSR[TX_STAT] 
is set; the content of that word is described in Table 2. 

FIG. 9 illustrates the transmit buffer format according to 
certain embodiments of the present invention. 

30 When a packet is scheduled through TX Ring 406 or 408 
to be transmitted on a TX MAC 218 or 234, respectively, the 
TX MAC controller 222 or 232, respectively, interprets the 
contents of the packet buffer 840 in accordance with the 
format shown in FIG. 9. The RX Status Word and TX 

35 Command Word 802 is found at the location pointed to by 
the 2 KB-aligned buffer pointer obtained from the TX Ring 
406 or 408. The RX Status and TX Command Word 802 is 
in the format specified by Table 1; when this word is 
interpreted by the TX MAC controller 222 or 232 only the 

4Q fields LENGTH and PKT_OFFSET have any meaning and 
the rest of the word is ignored. PKT_OFFSET indicates the 
byte offset from the start of the 2 KB-aligned buffer at which 
the first byte of the Ethernet header is to be found, and 
LENGTH is the number of bytes to be transmitted not 

45 including the (4-byte) Ethernet CRC which the TX MAC 
222 or 232 will generate and append to the packet as it is 
being transmitted. The RX Timestamp 804 was used by 
previous agents processing this buffer, and is not interpreted 
by the TX MAC controller 222 or 232. 

50 The PK^ OFFSET field can legitimately have any valu e 
between (16) and (255), allowing the agen t that schedul ed 
t |ie transmit to manipulate h eaders a nd to relocate t he start 
of the packet neader_812 as needed? FltJT 9 ~" shows a 
zero-filled two -byte pad 830 prior to the start of Ether 

55 Header 812, but that is not a requirement of tbe preferred 
embodiment; the TX MAC 222 or 232 can transmit a packet 
which starts at any arbitrary byte alignment in the transmit 
buffer 840. The two-byte pad 830 shown preceding the 
header 812 is shown to illustrate the common case, wherein 

so a received packet was thus aligned and any movement of the 
ethernet header 812 for encapsulation or decapsulation of 
protocols is in units of words (4 bytes.) Pad Space 810 can 
vary in size from zero bytes to (240) bytes as defined by the 
rake of PKT_OFFSET in the TX Command Word 802. 

65 The concatenation of lather Header 812 and Payload 814 
comprise the packet that is transmitted, along with the 
generated Ethernet CRC which the TX MAC 222 or 232 
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appends during transmit The Ethernet CRC field 816 is not 
.normally used by the.TX MAC 218 or 234, but was. written 
there during receive by the RX MAC 220 or 228. Each TX 
MAC controller 222 and 232 has a configuration setting 
which can instruct it to not generate CRC as it transmits; in 
that case the LENGTH field in the TX Command Word 802 
includes the four bytes of Ethernet CRC, and the data in 816 
is sent with the packet for use as the packet's CRC. This 
configuration which uses software-generated Ethernet CRC 
is provided primarily as a diagnostic tool for sending bad 
packets to other devices on the network. 

Upon completion or abandonment of a transmit, the TX 
MAC will write back the TX Status Word 806 and the TX 
Timestamp 808 if it is so configured. The TX Status Word 
806 contains the information and format shown in Table 2. 
The TX Timestamp 808 is written with the value of the 
Timestamp Register 214 at the time the write to TX Times- 
tamp 808 is initiated. 

The software data structure 820 which travels in the 
packet buffer 840 along with the packet is the same one 614 
discussed in the description of an RX buffer 620 as shown 
in FIG. 7, and may be relocated by software convention as 
described in the discussion of FIG. 7, 

The transmit status word 806 contains a flag indicating if 
the transmission was successful, and the reason for failure if 
the transmit was abandoned. This field is written only if 
MCR[TX_STAT] is set, otherwise the fields 806 and 808 
contain uninitialized data. 
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sources (PP, AP, Crypto, and potentially other network cards 
on the.PCIbus). A second input ring (Reclassify Ring) is 
provided for each CE for these other sources to schedule a 
packet for classification on that CE; each comprises a ring in 
memory with enqueue and dequeue operations supported 
through the RTU. The 32-bit entries in the ring are buffer 
pointers. 

FIG. 10 shows the reclassify ring structure. 

The Reclassify Rings 410, 412, 414, and 416 serve a very 
similar purpose to the RX Rings 402 and 404, and have 
substantially the same structure. The substantive differences 
are that there is one less interim consumer-producer in the 
Reclassify Rings, and that packet gets scheduled through the 
Reclassify Rings via a different path. Reclassify Rings 410, 
412, 414, and 416 are used to schedule packets for process- 
ing on CE 238, 208, 242, and 212 respectively. 

In the case of the RX Ring 402 or 404, buffer pointers are 
enqueued by the Buffer Allocation process 102 running on 
the Policy Processor 244 using MPROD 518, which allo- 
cates the referenced buffers as free and empty for the RX 
MAC 220 or 228, respectively, to consume using MFILL 
516 when receiving a packet and to produce a full, unclas- 
sified buffer to the CE 238 or 242, respectively. Packets 
scheduled for classification via the Reclassifying Rings 410, 
412, 414, and 416 come from a source other than the RX 
MAC'S 220 or 228, as illustrated in FIG. 2. Full, unclassified 
buffers get scheduled onto one of the Reclassify Rings when 
an agent enqueues the buffer pointer onto the ring by writing 



fer 



TABLE 2 



Bits 



Ethernet TX Status Word 



Field 



Description 



[31] TX_OK Packet was successfully transmitted. 

[30] LATE__COL Transmit abandoned due to a late collision, (only if (MCSR[LATE_COU_RTRY}™0)) 

[29] XS_COL Transmit abandoned due to excessive collisions (16 collisions) 

[28] XS_DEFER Transmit abandoned due to excessive deferrals 

[27] UNDERFLOW Transmit abandoned due to slow memory response times 

[26] GIANT Packet length was larger than legal 

[25:22] COL_CNT[3:0] Number of collisions experienced (never shows more than 15; if XS_COL this value is 'x') 

[21:11] reserved MAC writes 0x0 to this field. 

[10:0] TX_S[ZE[10:0] Number of bytes transmitted (includes the 4-byte Ethernet CRC) 
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There are 5 possible transmit packet sources sharing the TX 
MAC; these are 

The RISC processor (Policy Processor) generating or 
forwarding a packet 

Crypto generating a modified packet 

The AP either creating, forwarding, or modifying a packet 

A device in a PCI expansion slot creating, forwarding, or 
modifying a packet 

A peer PE forwarding a packet to a different network 
segment (e.g. for routing of switching) ss 

Atomic enqueueing by multiple sources is supported via 
writes to RTU[MTPROD] associated with that MAC's 
Transmit Ring. The RTU can detect high-water-mark con- 
ditions and signal the situation to the PP and the AP. The 
MTCONS index pointer is incremented by the MAC when- 
ever a buffer is retired; that is chased by another consume 60 
pointer incremented by reads of RTU[MTRECO V] which is 
used by the PP for recover of retired packet buffers to the 
buffer pool and (optionally) checking TX status. 

4. Reclassify Rings 65 
The Classification Engine receives packets to classify 
from both the RX MAC (via the RX Ring), and from other 



the buffer pointer to the RTU's 264 enqueue address, which 
causes the RTU 264 to write the buffer pointer to the location 
in memory 260 referenced by RPROD 916 and then to 
increment RPROD 916 modulo the ring size of 4096 bytes. 

From that point onward the description is substantially the 
same as the description of the RX Ring 402 and 404, except 
that RCCONS 914 is used in place of MCCONS 514, 
RPCONS 912 is used in place of MPCONS 512, the invalid 
region 902 and 908 substitutes for 500 and 508, Full and 
Classified 904 substitutes for 502, and Full Unclassified 906 
replaces 504. Since this flow has no allocation of empty 
buffers there is no equivalent to MFILL 516 nor to Valid 
Empty 506. 

Note that the "Outbound" classifiers 208 and 212 each 
have only a Reclassify Ring 412 and 416, respectively, but 
no RX Ring since they are not associated with an RX MAC. 

5. Crypto Command Queue and General Purpose 
Communication Rings 

In order to schedule buffers for processing by the external 
(and optioflal) encryption engine another memory-based 
ring containing buffer pointers is implemented, with 



(all) 
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enqueue and dequeue operations supported through the RTU 
for the Crypto unit to get the next buffer to -process, plus-a 
status bit indicating to Crypto that there is at least one packet 
buffer pointer in the ring to process. The information abou t 
what operations to perform, ke ys . etcTare embedded in a 
Crypto Command Descriptor in the software area of th e 
buner. 

FIG. fit shows the Crypto Ring and COM[4:0] Ring 
Structures 

The Crypto Ring 420, COM0 Ring 411, COM1 Ring 424, 
COM2 Ring 426, COM3 Ring 428, and COM4 Ring 430 are 
identical in structure. Any agent can enqueue a buffer 
pointer, or in the case of the COM Rings, any 32-bit datum, 
by writing to the RTlTs 264 enqueue address associated 
with the particular ring. This causes the RTU to store the 
buffer pointer or 32-bit datum to the location in memory 260 
referenced by the specified PRODUCE Pointer 1010 and 
then to increment PRODUCE 1010 modulo the ring size of 
4096 bytes. There is a producer-consumer relationship 
between a particular ring's PRODUCE pointer 1010 and that 
ring's CONSUME pointer 1008. When the RTU detects a 
difference between the values of PRODUCE 1010 and 
CONSUME 1008 it signals to the consuming unit that there 
is at least one entry to be consumed. 

The consumer dequeues a 32-bit entry from one of these 
rings by reading from the RTU's dequeue address associated 
with that particular ring; this causes the RTU to return the 
data at the address in memory 260 referenced by that 
CONSUME pointer 1008 and then to increment CONSUME 
1008 modulo the ring size of 4096 bytes. As is illustrated 
here, the degenerate case of the multiple-producer, multiple- 
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processing on this second processor 246 by enqueueing their 
. buffer pointers onto COM4 430. Alternatively, both the 
Crypto Ring 420 and COM4 430 can be used to schedule 
buffers for processing on the one Crypto processor 246. 

5 The general purpose communication rings COM[4;0]422, 
424, 426, and 430 are identical in structure to the Crypto 
Ring 420. 

6. DMA Command Queue and Descriptors 

10 

The DMA engine also uses a ring unit with an Enqueue 
register for any agent to schedule DMA transfers (SMA^_ 
PROD), a Consume register for the DMA engine to get 
entries from the ring (SMA_CONS), and a Dequeue reg- 
is ister for recovering retired descriptors (and the associated 
buffers) from the ring (SMA^RECOV). 

Tfre DMA enqin e is used to n^>ve between th e 
memory and the PClrms; the source/target on PCI can be 
host (AP} memory or another PCI device. DMA oper ations 
20 are scheduled by creating a 16-byte descriptor in memory 
a nd then enqueueing the address of that descriptor in th e 
f)MA pngiWs command ring bv writing it to DMA. 
PROD. The PP. the host a PCI bus peer, and Crypto can 
a tgrnically schedule use of this engine. 

DMA is notified by the RTU when the Produce pointer is 
not equal to the Consume pointer and processes the next 
descriptor. When that descriptor is retired, DMA increments 
the Consume pointer, a delta between that and the Recover 
30 pointer causes the RTU to signal to the PP that there are 
DMA descriptors (and the associated buffer pointers) to 
recover. 



TABLE 3 



DMA Descriptor Format 



PCI_Addiess [31:00] 


FLAGS [31:0] 


SI [31:27] 


BufAddress [26: 1 1 ] S2[10 :0](pointcr tag field) 


S3[15:ll] 


Buf_Start_tadex [10:2] ObOO Word_Count[15:0] 



consumer ring structure described in FIGS. 6, 8 and 10 is a 
single-producer, single-consumer FIFO with fifo-not-empty 
status presented to the consumer. The COM rings 422, 424, 
426 and 428 all report ring-not-empty status and 
(programmably per ring) either near-full or near-empty 
threshold status to the Policy Processor 244 through status 
registers in the processor interface 206. These rings can be 
assigned for any purpose; anticipated uses include a 
message-in ring for the Policy Processor 244, a ring for 
allocating buffers for use by remote agents, and a ring for 
allocating DMA descriptors for use by remote agents sched- 
uling this Policy Engine's DMA Unit 210. 

The Crypto Ring 420 reports ring-not-empty status to the 
Crypto Processor 246 through a status register in Crypto 
Interface 202. COM4 430 also reports ring-not-empty status 
through a similar location, so that COM4 430 can optionally 
be used to support scheduling packets for processing by a 
second Crypto Processor 246. The Crypto Processor Inter- 
face 202 has additional support for a second Crypto Pro- 
cessor 246, which might be added to provide either more 
bandwidth for encryption processing or additional function- 
ality such as compression. Packets would be scheduled for 
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The areas labeled "S2" and "S3" are available for software 
use. "SI" is reserved for future expansion of PE memory 
size. 

Upon completion of a transfer, the DMA engine can 

50 optionally set a completion status bit in either the Host 
Interrupt Register or Processor Interrupt Status Register in 
case the initializing agent wants completion status of a 
transfer or group of transfers. 8 bits are provided in each so 
that transfers can be tagged as desired. This allows both AP 

55 and PP software to have up to 8 DMA completion events 
scheduled at one time for tracking when particular groups of 
transfers have completed, or for the PP to signal to the AP 
that information has been pushed up to a mailbox or com- 
munication ring in AP memory, or for similar signals from 

so the AP to the PP 

The Packet Buffer Address field contains the packet buffer 
pointer in the same format that is used by all other agents in 
the Policy Engine, this means that bits [10:0] are ignored by 
hardware and might contain tag information. The actual 

65 memory word address is the concatenation of the 2 
KB-aligned Packer_Buffer_Address[31:ll] with Start_ 
Index[10:2], with 00 in the lower two bits. Note that the 
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Word Count allows for a maximum DMA transfer of (64 

- K-l Words, or 256-K-4-Bytes), in case there are transfers 
larger than normal packet buffer movement (e.g. moving 
down PP code or CE microcode). 

The Flags word contains the following fields: 



30 



are dequeued by the Crypto Processor which will then 
enqueue them on the specified- destination ring after pro- 
cessing. Buffers that are scheduled for DMA are recovered 
at the same time the associated DMA descriptor is recovered 
from the ring. Buffers may be temporarily absorbed by an 



TABLE 3a 



DMA Descriptor "Flaes" Word 



Bits Field 



Descriptions 



[31:21] SOFTtlOiO] 
[20] TO_MEM 
[19:16] PCI_CMD[3:0J 



[15:08] SET_HISR[7:0] 
[07:00] SET„PISR[7:0] 



Available for software use. 

Direction: 1 Tb Memory (From PCI), 0 «» From Memory (To PCI) 

This is the PCI command code which is used on the PCI bus for these transactions; the 

most 

common codes will be 0x7 (Memory Write) and 0x6 (Memory Read) with some probability 
of also using OxC (Memory Read Multiple) and OxE (Memory Read Line) if the attached 
host uses them for prefetch directives. 

Any bit that is set will set the corresponding status bit in the HISR upon retirement of this 
descriptor. If no bit is set, no status is sent to HISR. 

Any bit that is set will set the corresponding status bit in the PISR upon retirement of this 
descriptor. If no bit is set, no status is sent to PISR. 



Since DMA descriptors are read from memory by the DMA 
engine, software must ensure either that the descriptors were 
non-cacheable by the processor, or that they are flushed from 
the PP cache prior to writing the descriptor's address to the 
DMA ring. 

For descriptors that are generated by the AP of by a PCI peer 
see "Endianness" in section 8 for details about descriptor 
endianness. 

FIG. 12 shows the DMA Ring Structure. 

The DMA Ring 418 is substantially the same as the TX 
Rings 406 and 408 as described in FIG. 8. There is a single 
enqueue index DMA_PROD 1116 used to schedule pointers 
on the ring 418 by any agent, and interim consumer- 
producer index DMA_CONS 1114 used by the DMA Unit 
120 to consume newly scheduled descriptor pointers and to 
produce retired descriptor pointers, and a dequeue index 
DMA_RECOV 1112 used by the Policy Processor 244 to 
recover pointers, and a dequeue index DMA_RECO V 1112 • 
used by the Policy Processor 244 to recover retired descrip- 
tors as well as the buffers associated with them using the 
buffer pointer embedded in the DMA descriptor being 
recovered. Differences between DMA^_PROD 1116 and 
DMS_CONS 1114 are detected by the RTU 264 and 
reported to the DMA Unit 120. Differences between DMA_ 
CONS 1114 and DMA_RECOV 1112 are reported by the 
RTU 264 to the Policy Processor 244 through a status bit in 
the Processor Interface 206. Region 1106 contains one or 
more descriptor pointers which point to DMA descriptors as 
described in Table 3. Region 1104 contains descriptor point- 
ers of descriptors which have been retired by DMA 120 but 
have not yet been removed by Buffer Recovery 118. Invalid 
1102 and 1108 are the unused space into which more 
pointers can be scheduled. 

7. Buffer Allocation/Flow 

At initialization time the software allocates a pool of 
size-aligned 2 KB buffers in memory. Enough of these are 
allocated to each of the RX rings (that is, the buffer pointers 
are enqueued on those rings by writing them to the associ- 
ated RTU[MPROD]) to provide the desired elasticity for the 
RX MAC, and the rest are placed on a freelist (e.g. on a 
software-managed linked list.) Each time the PP dequeues a 
buffer from the RX ring it can allocate a new empty buffer 
from the freelist, thus keeping the pool size constant. Buffers 
that go through Crypto may be enqueued by any agent and 



application if it is queueing packets for delay. A reference 
count can be maintained in buffers which go to multiple 
readers so that they retire only when all readers have retired 
them. 

The goal is that the PP can handle buffer allocation and 
recovery through the read of status bits in the PISR, reads of 
RTU recover of dequeue addresses to recover retired buffers 
when the RTU indicates through the PISR that the particular 
rings have buffers to recover, and writes to ring RTU 
enqueue addresses to allocate new buffers. It is a primary 
goal that copying of buffers is avoided except when abso- 
lutely necessary. 

Rings report threshold warnings to the PP/AP through the 
CRISIS register when there is danger of under/overflowing 
(within Va ring-size of a problem situation) and also report 
full/empty status of rings through bits in the CRISIS Reg- 
ister as appropriate. 

7.1. The Life of an RX Packet Buffer 
Ideally, a packet arrives into a buffer, gets processed, and 
then 'gets transmitted out the other port or gets dropped. 
Processing may include a decision by the application to 
enqueue the buffer for temporary delay (and possible later 
dropping), to feed a packet through the local optional Crypto 
for encryption work, or to pass a packet to the AP or external 
coprocessor (see FIG. 4). The key concept is to think of a 
packet as being "owned" by some agent, and that agent 
taking responsibility for the final disposition of the packet. 

7.2. Flow of a Buffer Which Remains Local 

At the beginning of time the system allocates a number of 
buffers to an RX MAC by writing their pointers into that RX 
Ring's RTU[MPROD] enqueue register, which presents 
these buffers to that MAC as empty/allocated. These buffers 
are now owned by that RX MAC, and cannot be touched by 
others until the MAC has so indicated. When the RX MAC 
has filled a buffer with a newly received packet it passes 
ownership to the associated Classification Engine by mov- 
ing the MF1LL pointer to the next entry (buffer pointer) in 
the ring. The CE will detect this, then process that packet, 
when it is done it passes ownership to the PP by increment- 
ing the MCCONS index modulo ring size, and then the 
application(s) running on the PP will determine what action 
(s) to take. Ownership of a buffer is always explicitly 
relinquished by the current owner. 
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Hie PP can perform any conventional actions with a packets at the AP. The, latter is interesting because it m av 

buffer. Examples of actions for a buffer -which remains i nvolve a fork where a packet takes two paths; one to a MA C - 

entirely local are DROP, FORWARD, MODIFY or tempo- transmit queue, and a sccondjo the PCI bus; reclamation of 

rarily ENQUEUE then later FORWARD- that buffer would require a convergenc e of completion, tha t 

5 i s. a "join" function b efore the huffer can b e reclaimecTft f 

DROP co pying is to be avoided.) Software can maintain a reference 

„, , . , f , count in th e buffer for th fo piirno^e - 

Ine code running on the PP determines that there are no „ rr~ \ ' ' * "Tl , . 

a,^u t iu t , c *l- l a •* ** / Forwarding a packet to the AP can be in the guise of 

further uses for the contents of this buffer, so it retires/ 1M . ' . * *. ■ 

m ( l l pc r r > ' 11 +U * il *u a *• NIC-lifce behavior or for application-specific communica- 

recovers the buffer. Typically this occurs when the Action 4 . , . A . * , «• - 

«,4-« i- *• /\ * *l nnj • j , u * 10 tion. In either case the packet s buffer pomter is written to a 

portion or the apphcation(s) running on the PP decide that a m/ A , . , , \ M ^ , A , - * 

1 * j f * *u •* • r • •* r j DMA descriptor as the MEM ADDR, and after the rest of 

packet does not meet the criteria for passing it forward. A , r . t . . — > \ . ^ * 

r ° the DMA descriptor is created the pomter to that descriptor 

FORWARD k en£ l ueue d on the DMA engine's command queue. As with 

all other queues described so far, the PP has a trailing 

The PP enqueues the pointer onto the appropriate TX ring; 15 recover pointer DMA_RECOV and receives status in the 

TX is fire-and-forget (with optional completion status from PISR from the RTU when there are retired descriptions to 

the MAC), with the hardware responsible for either com- recover, 

pleting or abandoning the transmit (that it, the TX MAC ^ "njc' interface as seen in host memory can be 

owns that buffer). Some time later in the buffer reclamation arbitrarily complex, but can be as simple as a memory image 

code, the PP will recognize that the TX MAC has retired this 20 of a buffer pool and pointer ring with a produce 

packet (is done with it) since the RTU indicates that there is and a consume pointer, all in host memory, the "RX NIC 

a delta between MTCONS and MTRECOV, thus ownership interface" can mean reading a pointer to a free buffer, 

of that buffer has transferred back to the PP. The PP then DMA'ing the entire packet buffer to that location, following 

checks TX completion status (if the application(s) care) and mat wilh a DMA of a new value to tne "Produce- p^ter 

recovers the buffer or reschedules the transmit as appropri- ^ associated with it, and an interrupt to the host (using one of 

ate * the bits fflSR[DMA_DONE[7:0]J upon completion of that 

MODIFY DMA. More efficient host structures can be implemented 

without much more complexity. Communication down from 

Jhe application may choose to send the packet through the AP can also use the DMA engine and can involve a 

Cr ypto for processing, may, encapsulate/decapsulate th e 30 similar software ring structure in either host or PE memory; 

p acket, could do address translation, or can do any other messages and/or ring indexes are written by the AP into one 

modification of the packet that the application directs. of the 16 Mailbox locations provided, which write data to PE 

memory and set a per-mailbox status bit which sigQals 

ENQUEUE mailbox status through the PISR to the PP. 

The application running on the PP determines that it wants 35 Apeer-to-peer routing operation with a push model might 

to , hold on to the packet for some period of time, after whic h re( 3 uire a bu1ffer P° o1 in PE memory to be allocated for each 

i t will either f orwar cl or rimn it/ O wnership of that buffer P eer that wlU be dome ihv5 '> then sendmg a packet to another 

stavs with the application until it relinquished it by enque ue- PoI,c y En Z mc for transmit is as simple as scheduling a DMA 

in g the buffer's pointer on the appropriate TX or Reclassi fy t0 C( W fte data fzom me local buffer to a buffer m this PE ' S 

r ing , or hv deciding in DROP it r in which case the same p ath 40 buffer P° o1 on me remote PE, followed by a DMA of the 

a s DROP fabove^ is followed. I n the_ Enq ueue case the pointer to that buffer (in the "local"]pointer format) into 

a verage residency nf a pa^et in a memory buffer is mu ch RTU[MTPROD] to schedule it for transmit. Later the remote 

longer than in the simple DROP or FORWARD cases, s o if PP wil1 reclaim the buffer s °me time after the transmit is 

applications are enqu eueing packets then care must be taken done > m6 wm send back ^ pointer (or a "credit" message) 

to al locate a large eno ugh b utter pool. 45 bv DMA'ing it to this PP's "freelist" ring for that particular 

peer. 

7.3. Buffer Handling for Packets Sent to the PCI Another more general method of allocating buffers and 

Bus DMA descriptors to remote masters is to assign one of the 

The applications ) on the PP may decide that a packet 50 ^l-purpose COM rings to contain a free&t of buffer 

.tf.M M^JUi^J Wvtei tor -fttrffieTTmic^ g 5 ° ^ers and a second to contain , a fieri* of DMA dHcnp- 

o r because the nacketincuraUv targ eted at the AP as th e ? r p0mte f' ^ , T^",^ f K°f 

final destination^ e.ther case ,t ,s necessary ,n m.pra.e .h e . ^^f^T^^l!*^ 

— 1 , /. T o: • i * r\> f • . i q * t • and a DMA descnptor for scheduling a fill of that buffer, 

packet to buffers in the AP's memory (e.g. into mbufs in the A „ , , V. . . 

stack running there or into appUcation-specific storage.) The « A ^ l mod ^ of ^mmunication would have the remote 

buffer itself is not migrated, some or all of its contents are paster send only a (PCI) pointer or a descriptor down 

copies to a different buffer in host memory, this is done using ? rou ? h Clth f r a mailb( ? x ° r n a CO t M ™8 aUocatcd for mis 

the DMA engine function, and requires the PP to select a buffer from its own 

* I, >. , ... , A . pool of buffers allocated for this purpose, using DMA to 

Alternatively the application could choose to store the , u u ff ^ , u , -lit 

t *i il au * • • 4 • i.- r iL v /r \ j copy the buffer from the remote memory into local memory, 

packet locally (tha is, maintain ownership of the buffer) and „ ^ ^ whatever ^ fof ^ ^ 

simply pass .pointer and oU.er ^formation up to .the AP to 0wnershi of ^ actual buffef ^ ^ cas6 al b P do 

this case the PP cannot reclaim the buffer until the AP has to the PP 

informed the PP that ownership of the buffer has been 

released back to the PP. 7.4. Placement of the Software Structure in the 

Other reasons for sending packets up to the PCI bus 65 Buffer 

include a push -model peer-to-peer copy to a different Policy While the hardware defines the location of the receive and 

Engine or external coprocessor, and logging of selected transmit control and status words and the location of the 
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packet in the packet buffer, it is only by convention that the 
software structure resides forward from the 2 KB -aligned 
buffer pointer. A different convention can be used where the 
software structure of N bytes actually begins N bytes before 
the 2 KB-aligned buffer pointer, in this case the buffers 
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activity, to PIO access from the PCIbus to/from memory, and 
also reads and writes from PCI through the Ring Translation 
Unit; the rings are simply memory with fancy address 
translation. 



TABLE 4 







Bvte Lang Steering, PCI64-to- Memory 






(byte 7) 
PCrj63:56] 


(byte 6) 
PCI[55:48] 


(byte 5) (byte 4) (byte 3) (byte 2) 
PCI[47:40] PCl(39:32] PCI[31:24] PCI123:16] 


(byte 1) 
PCI[15:8] 


(byte 0) 
PCl[7:0] 


M[7:0] 


M[15:8] 


M[23:16J M[31:24] M[39:32] M[47:40] 


M[55:48] 


M[ 63:56] 



15 



managed and allocated by software are actually (2 KB — N)- 
byte aligned, and the RX status word is placed N bytes into 
the buffer which Lands it precisely on the 2 KB-aligned word 
where it already goes, hardware doesn't know the difference, 
but software can take advantage of such a structure to allow 
for arbitrary-sized packets from any media, which start 
forward from the RX status word just like the ethernet 
packet but may occupy contiguous memory far bigger then 
an ethernet packet would. By placing the software structure 
before the RX status word, the structure does not have to be 
moved to accommodate larger packets. 

8. Endiannes 

8.1. Overview 

Internal to the Policy Engine ASIC, all agents are big- 
endian. This includes the MACs, memory, the CEs, the 
Policy Processor, the Crypto port, and the DMA engine 
descriptor format. This choice is most convenient for dealing 
with protocol headers, which are typically big-endian native. 
The CE itself has no endianness since it works only in units 
of =bits throughout; however, it does deal with mulitbyte 
data in the way those words are formatted in memory, thus 
it sees the big-endian layout of the packet buffer contents 
and also writes its status words and hash pointers in big- 
endian format, which is what the PP expects to see. 

All PIO accesses from PCI to registers (PCI address range 
recognized by BAR1) are required to be 32-bit access only. 
The registers connect to the PCI bus so that bit <0> of the 
host CPU register is bit <0> of the PE register, and bit <21> 
corresponds to bit <31>. This implies that bit <0> of a 
register access travels on bit<0> of the PCIbus. Registers are 
placed on doubleword boundaries but are accesses as words, 
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TABLES 





Bvte Lane Steering. PCI32-to-Mcm 






(byte 3) 


(byte 2) 


(byte 1) 


(byte 0) 




PCI[31:24] 


PCI[23:16] 


PCI[15:8] 


PCI[7:0] 


Fiist data phase 


M[3932] 


M[47:40] 


M[55:48] 


M[63:56] 


(oi word at 0x0) 










Second data phase 


M[7:0] 


M[15:8] 


M[23:16] 


M[31:24] 


{or word at 0x4) 











This byte-lane steering has some interesting implications 
that need to be understood so that it is clear when software 
will have to twist data. Four interesting cases will be 
examined (a) the host writing a DMA descriptor into 
memory for the DMA engine to consume, (b) the host 
writing a message to the PP in memory, (c) the PP writing 
a message in memory that is DMAM to host memory, and 
(d) issues surrounding loading of CMEM in the four CE's, 

8.2 Host Writing a DMA Descriptor in Memory 



The DMA descriptor is not a byte stream, therefore the 
endian-neutral PIO from the host to memory is not sufficient. 
The DMA engine sees the descriptor as a 16-byte-aligned 
big-endian data structure as shown in Table 3 on page 22. 
For this example the fields are simplified into a 32-bit PCI 
45 address PA, a 32-bit Buffer Address BA, a 16-bit offset OF, 
a 16-bit Word Count WC, and a 32-bit Flag word F. 

Here is the big-endian view of that descriptor as it appears 
in memory and as the DMA engine interprets it: 



TABLE 6 



DMA Descriptor Byte Ord er, big endian memory 
(byte 0) (byte 1) (byte 2) (byte 3) (byte 4) (byte 5) (byte 6 



PA[31:24] PA[23:16] PA(15:08] PA[07:00l F[31:24] Fl23:16] 
BA[31:24J BA[23:16] BA(15:8] BA{7:0] OF[ 15:08] OFptf] 



F[15:08] 
WC[15:08] 



G>yte 7) 
F107:00] 

wq7«] 



and the data travels on bits <31:0> of the PCI bus even if the 60 
bus is connecting 64-bit agents. As word-only entities the 
registers have no byte order issue. The same is true of PCI 
Configuration Register accesses. 

All transfers bet ween memory and the PCIbus move data 
bybyte lane; this means that by te <0> in memory travels on 65 
hytn rfh- on thn P C Ibus. bvte <1> on byte <i> 7 etc. 1 his is 
endian-neutral for byte streams. This applies to all DMA 



Assuming that the host (AP) will write to this data 
structure in PE memory using word PIO's over PCI (for the 
example shown), the host must pre-scramble those words so 
that the data will arrive in the correct byte lanes: 
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TABLE 7 



DMA Descriptor Byte Order, little eadian register 

(byte 3) (byte 2) (byte 1) (byte 0) 



First data phase 


PA[07:00] 


PA[ 15:08] 


PA[23:16] PA[31:24] 


(word at 0x0) 








Second data phase 


F[07:00] 


F[15:08] 


F[23:16] F[31:24] 


(word at 0x4) 








Third data phase 


BA[7:0] 


BA[15:8] 


BA[23:16] BA[ 31:24] 


(word at 0x8) 








Fourth data phase 


WC(07:00] 


WC(15:08] 


OF17:0] OF[15:8] 


(word at OxC) 









and then when the host writes the address of the descriptor 15 
into the DMA ring (which is "byte -lane" memory), that 
descriptor pointer is written as a word with the following 
content: 
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that is seen in the AP's registers. Diagnostic PIO*s of data 
are sent to CMEM in the order [Least Significant Word, then 
Most Significant Word]to construct the 64-bit instruction. 

The FILL_DMA path takes 64-bit words from PE 
memory and writes them into the 64-bit CMEM. The 
compiler and host software always handle 64-bit instructions 
in their native (that is, readable) form. CMEM instructions 
are laid out as native 64-bit units in host memory; the 
host/compiler does not need to twist them to help the 
(other-endian) recipient. When the data arrives in PE 
memory, each 64-bit instruction will. arrive byte-swapped 
due to byte-lane steering; that is, the instruction 

0XAABBCCDD_EEFF0123 

in host memory will land in PE memory as 

0X230 1FFEE_DDCCB BAA 

and the CE CMEM Fill data path is wired as shown in Table 
4, so that the bytes land in the correct place. Thus the MSB 



TABLE 8 



Descriptor Pointer Byte Order, little endian register 

(byte 2) (byte 1) (byte 0) 



(byte 3) 



DESC_A[07:00] DESC_A[15:08] DESC_A[23:16] DESC_A[31:24] 



Note that reads and writes through the ring unit are accesses from PE memory will go to the LSB in CMEM, and vice 

to memory, not to registers, which is why the address_ versa. This works whether the data arrived in PE memory via 

shuffle (where "the address" is data, as above) is required 30 a PIO from the AP or via a DMA from host memory prior 

when the host is writing to the ring-enqueue address. to the FILL__DMA transfer into CMEM. 

The upshot of all this is that the CMEMJILL DMA unit 

8.3 Host Writing a Message to the PP in Memory views PE memory as little-endian; and it doesn't matter to 

™ ™ iL L- j* • a. anyone using normal paths that CMEM microcode images 

The PP views the memory as big-endian m the same 35 J & , „ F . . , . . ° _ 

.« . Jr ^ - hq~> are byte-swapped while they reside in the staging are am PE 

manner as the DMA engine, so the example in 7.8.2 J ™ • • T1 ,.,1 c 

j •« ,« « . 1 % w ... memory. This is all hidden from software, 

describes this path as well. Messages are either a byte J 

stream, or require the host to manually byte swap larger data. ] y Classification Engine 
The contents of a mailbox and the contents of any ring entry 

or other item in memory will follow the same format as 40 Tt* Classification Engine (CE) is a microprogramm ed 

shown in Table 8 processor designed to accelerate predicate analysis m net- 
work infrastructure applications. The p rimar y functio ns 

8.4 PP Writing a Message in Memory that is commonly u sed in predicate analysis include parsing layers 

DMA'ed to the Host of successively encapsulated headers, table lookups, and 

checksum verification. 

if messages sent up to the host are simply a byte stream ^ Header parsing consists of extracting arbitrary sirlgle . or 

then there is not issue, smce byte streams travel in an / multiple _ bitf fields from those headers, comparing those 

endian-neutral way. If on the other hand the message I fidds , 0 Qne Qr more lhen taking me results of 

includes data that are larger than a byte (e.g a buffer these comparisons and doing boolean reductions on mulUple 

pointer), byte swapping occurs and both ends of the com- \ exlraction results t0 reduce lhem finally t0 a single 

mumcation must be aware of this. ^ « matcnes /doesn't-match" status for each complex predicate 

For example, if the PP wants to send a 32-bit address to \ statement; this single boolean value can then be used to 

the host, it must byte swap within that word before sending Luickly dispatch the appropriate actions at the PP. The size 

it. That is, if the PP wants to send the 32-bit word EXDEAD- Kf eacn header is also determined so that the next level of 

BEEF up to the host as a message, then the PP must put it $5 (protocol can be found and parsed in sequence. Applications 

into memory as OXEFBEADDE (see Table 5. ) j C m also choose to examine packet contents in addition to the 

o , ™ • ^ • ^^wp,, / headers if desired; the CE does not treat the header portion 

8.5 Classification Engine CMEM Fills ^Tpacket any differently from the payload portion. 

Writing instructions into CMEM in the Classification Table toukupa u rn consist of comparing an extracted 

Engine takes one of two paths; the data is either DMA'ed or 60 value against a table of constants, or can involve generating 

PlO'ed into PE memory from the host and then copied from a hash key from extracted values and then doing a lookup in 

memory to CMEM by the CE (using the CE's FILL_DMA a hash table (content-addressable table) to identify a record 

unit), or the host can PIO data directly into CMEM over the associated with packets matching that key; the record can 

Register interface (CMEM__DIAG access). contain arbitrary application-specific information such as 

The CMEM_DIAG path is word-oriented and no twisting 65 permissions, counters, encryption context, etc. 

occurs, since it is all via the register path. The 32-bit data and Checksum verification involves arithmetic functions 

addresses seen in the host processor is the same 32-bit data across protocol headers and/or packet payloads to determine 
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if the packet contents are valid and thus comprise a valid of the condition fl Uflgw nflntr ol instruction: as a reftvlt nf thfc 
packet. A special adder parallel to the mask-rotate unit called - address stored in the microstack upon a successful CALL is 
split-add adds the upper and lower half of a 32-bit operand the address of the first instruction following the delay slot. 
together and produces a 17-bit result for use as an operated *fhe CE also contains several special purpose registers and 
by the ALU; this is used in TCP, UDP, and IP checksum 5 also supports execution of many special operations. Special- 
computation, purpose registers include the interface to PE memory, the 
Since one purpose of the CE is to help the PP to avoid condition code register, a memory base pointer register used 
needing to touch packet contents and this fault portions of for base-index access to packet buffers in PE memory, a 
the packet into the PP's data cache, the CE can also be chip-wide timestamp timer, and instrumentation and diag-- 
programmed to extract arbitrary data fields and optionally do 10 nostic registers including a counter which monitors execu- 
computations on them, then pass the results to the applica- tion time and a counter which tracks stall cycles due to 
tions running on the PP via the packet buffer's software data various memory interface delays. 

structure. The memory interface appears to be the microcode as 3 

^.software sft-nrture.Jc ramVH i n the, p acket buff er alo ng FIFO's ; DFIFO_W receives one or more words of data to 

with the packet and the associated MAC status. This stru c* 15 be packed into a memory burst access for stores, DF1FO R 

h ire is written with pre dicate, analy sis results, h ash tabl e unpacks requested bursts of data that have been read from 
pointers to recor ds found, hash insertion pointers in the ca se memory, and MEM__ADDR receives PE memory addresses 
o f a failed search, p^e.r.ksiim r esults, a pointer to the base of along with the size and direction information. Reads (or 
ea ch protocol found, extracted and computed fiel ds, etc^for "loads") are non-blocking; microcode schedules a load and 
use by the application^) running on the PR 20 then can take the data from DF1F0_R at any time later; if 
In order to ac celerate these functions, the Classification the data has not yet arrived then the pipeline will stall until 
Eflguje loads so me or all of th e packet from the PE 7 ? {i d ° es - ^ pipeline will also stall if there is an attempt to 
SPRAM-based memory (FFlfomory) into a pack et write data to DFIFO_W and there is no room or if there is 
memory (PMEM) which it can then access random ly or an attempt to schedule another address in MENLADDR 
sequ entially to extract fields from the packet. A'masFand- and there is no room. Both of these conditions are self- 
rotate unit allows arbitrary bit fields to be extracted from clearing as the fifos drain to the chip's memory controller, 
words of the packet which can then be used as operands in Extensive error-checking checking logic uses counters to 
computation or as comparison values for bulk table com- track the state of various parts of the memory interface and 
parisons. Table comparisons or individual arithmetic and will not allow microcode to oversubscribe DFIFO_R nor to 
logic operations can set one or more bits in the result vector isa« a write ("store") to memory unless precisely the right 
which is a large, 1-bit wide register file. These RESVEC bits number of words of data have already been scheduled in 
can then be accessed randomly and arbitrary boolean opera- DFIFO_W. Memory accesses sizes are 1, 2, 4, or 8 32-bit 
tions can be done on pairs of bits to produce more RESVEC words. 

bits, at a rate of up to two boolean bit operations per cycle, 35 Using the memory interface for a store consists of writing 

eventually reducing sets of bits to single-bit predicate the desired number of words of data to DHFO_W, then 

results. Gang operations (GANGOPs) help optimized bool- committing the store by scheduling the address into MEM„ 

ean reduction by doing a logical operation (OR, AND, NOR, ADDR along with the appropriate size code and the direc- 

or NAND) on any number of selected bits within a 32-bit tion flag for a store. Using it for a load consists of scheduling 

group of RESVEC bits in a single clock, producing a single ^ the address, size, and direction flag for a load into MEM_ 

RESVEC bit as a result. After boolean reduction is ADDR, then consuming precisely that many words in order 

complete, some or all of the result vector can then be spilled from DFIFO_R at some later time. DF[FO_R holds up to 

to the software structure in the packet buffer in PE Memory 4 maximum-sized bursts or up to 32 words of data scheduled 

for use by the Policy Processor. as smaller reads, so properly written microcode can often 

A 32-bit Arithmetic and Logic Unit (ALU) and a set of 45 hide the latency of reading PE Memory by scheduling 

general-purpose 32-bit registers (GPREG) allow for general several loads before consuming the result of the first. Bulk 

computation as well data movement such as filling PMEM with a p acket can keep 

Program flow conlrol in the branch unit allows the micro- sev * ral reads landing in 'pipped fashion to move data 
code to decide if the next instruction in the microcode at the maximum memory bandwidth available, 
control store (CMEM) comes from a sequential location, 50 These non-blocking loads help to accelerate hash table 
from a relative-branch value which can be an immediate searches and linked-list searches; once the header of a record 
value in the microword or the contents of a GPREG, or (in has been fetched, the forward pointer can be used to specu- 
the case of a RETURN) from the top of the hardware latively fetch the next record before doing any key com- 
microstack; microstack values are enqueued when a CALL parisons with the current one, hiding much of the memory 
style of branch is executed, and the microstack is accessed 55 latency and generally overlapping computation and memory 
in JJFO (last-in, first-out) fashion to support nested subrou- access so that hash searches can be done as fast as the 
tines in the microcode. Branch, Call, and Return operations records can be fetched from the SDRAM (PE Memory), 
are all conditional based on any of the rich set of condition Special Operations include various administrative func- 
codes provided. When the microcode bit "BRANCFUN" tions that the CE uses; these include functions such as 
is set then a Branch, Call, or Return is executed if the 60 incrementing MCCONS and RCCONS in the RTU, flash- 
selected condition code is true, calls and returns are done if clearing the general purpose registers and the result vector, 
the associated bit CALL or RET is et in the control word selecting immediate or index-register addressing for PMEM, 
when BRANCH_EN is set. Due to pipelining of the loading the PMEM index pointer and setting or clearing its 
microsequencer all pro gram -flow changes have a 1 -cyc le sequential access mode, managing a sequential index 
tfelav before taking effect, so tire instructio n f ollowing any 65 counter for RESVEC used for table comparisons and result 
of-program flow control instruclions (trie "branch delay spills, halting the seqeuncer or putting it into a power-saving 
slot") is always executed regardless of the success pL^ailure sleep mode, managing certain special condition codes, etc. 
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Bulk Table Comparisons (using the cmpra instruction) 
implement the GE's only multi-cycle instruction; prior to 
executing cmprn, one or two 32-bit comparison values are 
loaded into general purpose registers. In the first cycle of a 
cmprn instruction one or two general -purpose registers are 
identified as the A-side and B-side comparison values (both 
can be the same register if desired), a starting index into 
RES VEC is set, four special condition codes associated with 
bulk table comparisons are cleared, an instruction-length 
counter is initialized to the instruction length "N", and the 
entire processor is set for cmprn mode. The next "N" 64-bit 
microcode words are interpreted as pairs of 32-bit values for 
comparison rather than as microcode; one 32-bit value is 
compared to the A-side register and the other is compared to 
the B-side register, and if either matches the associated bit 
in the (even, odd) bit pair pointed to by the RESVEC_ 
INDEX is set; then the RESVEC_INDEX in incremented to 
point at the next bit pair, the length counter is decremented, 
and the next comparison value pair is fetched from CMEM. 
The process is repeated until the length counter reaches 0. 

Associated with this process are the four condition-code 
bits MATCH_A, MATCH_JB, M ATCH__ A_0 R_B , and 
MATCH A_AND_B, which indicate that at least one table 
value matched on the A-side, on the B-side, on either side, 
or on A or B-side together (as a 64-bit match), respectively. 

Given this facility it is possible to compare one extracted 
value to (2*N) constants or to compare two values to N 
constants each, in a total of (N+l) cycles. These bulk table 
lookups are useful for rapidly searching small tables as part 
of predicate analysis; hash-table lookups are used for larger 
tables when it becomes more time-efficient to do so. 

Another special condition-code is "Sticky-zero" or "SZ". 
It is used to cumulatively check status on a chain of equality 
comparisons of the form "if (A=X) and (B=Y) and (C=Z) 
and (D=W) then ..." by first setting the SZ bit in the 
Condition Code Register using a special operation, then 
doing a series of equality comparisons or other arithmetic 
functions, then doing a conditional test of SZ; the bit stays 
set as long as the result of all intervening operations that set 
conditions codes have the "data equals zero" status. Any 
"data not equal to zero status" result in the series will cause 
SZ to clear and stay clear. 

A messaging facility between the CE and the PP is 
provided; the CE can set any of 4 status bits which cause 
status to become visible to the PP (Message-Out bits) and 
the PP can set any of the 4 status bits (Message-in bits) 
which the CE can test as condition codes. These bits can be 
used for any messaging purpose as assigned by software. 

Two other condition code bits are "RX_RING_j*DY" 
and "RECLASS _JUNG_RDY", which are used by the 
RTU to indicate to the CE that there is a least one buffer 
pointer for it to process in the two buffer pointer rings on 
which it is a consumer, one ring is the "RX Ring" and always 
carries packets from the associated RX MAC to this CE, and 
the other is called the "Reclassification Ring" through which 
any party can schedule a packet to be processed on this CE. 

In summary, the Classification Engine tests the two ring 
status bits and the 4 message bits in a dispatch loop, and calls 
the appropriate service routine when a condition is found to 
be active. (When not conditions are active the dispatch loop 
sets the CE into "sleep mode" to reduce power 
consumption.) The ring service routines fetch a packet buffer 
pointer from the associated ring, fetch some or all of the 
packet (only as much as the microcode will need to examine, 
or all of the packet if checksums are to be validated on the 
payload), then starts with the first protocol header and 
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executes a series of application-specific operations to extract 
fields. from the packet, identify and process arbitrary proto- 
col headers, do table lookups via bulk comparisons or has 
table searches as directed by the application, do checksum 

5 verifications as programmed, do boolean reduction on 
interim results, extract and optionally compute on arbitrary 
fields in the packet, and finally to write all results to a data 
structure in the per-jacket result area that travels with the 
packet in the packet buffer in SDRAM. The results written 

io include the set of single -bit predicate analysis results, has 
search results (a pointer to the record that matches the key 
extracted from this packet or a pointer to where a hash 
record should be inserted if one does not exist and the 
application wants to create one, for any number of different 

15 tables with different keys), plus any extracted or computed 
values (such as index pointers to the start of each layer of 
protocol header) desired by the application. Microcode can 
be loaded into CMEM by the AP or PP, or by the CE itself 
once it has been loaded with its initial microcode. 

20 The following pages include a block diagram of the CE, 
a table identifying the various microcode control bits, for- 
mats for the microcode, and tables of relevant values. 
1. CE Block Diagram 
FIG. 13 shows a block diagram of the Classification 

25 engine. 

1.1 Overview of the Classification Engine in FIG. 13 

The Classification Engine is a pipelined microsequencer. 
A 64-bit microword is fetched from Control Store CMEM 
1202 using an address supplied by register PC 1234, and is 

30 stored in the instruction register I-REG 1216. This cycle is 
referred to as the Fetch cycle 1302. 

The 64-bit microword in I -Reg 1216 has 7 bits each 
dedicated to enabling the retirement of a result by causing 
registers to be loaded. One of these bits is reserved for future 

35 enhancements, while 6 of them have specified functions as 
described in Table 16. This group of signals are known as the 
write enables WE[6:0]. The WE bits also have function- 
specific names as shown in Table 1; BRANCH_EN, REG_ 
WE, CC_WE, RESVEC_WE, PMEM_WE, and 

40 SPECOP_EN. 

BRANCH__EN enables conditional program flow 
changes if a condition test is met. It controls units in the 
Address Generation Unit 1230. 

REG_WE enables retirement of 32-bit results in the 

45 work-oriented half of the machine to all of the general- 
purpose registers and special registers listed in Table 17. It 
also has side effects of incrementing the pmem 1204 index 
counter PCNT 1222 or dequeuing a work of data from 
DFIFO__R 1250 under certain circumstances. 

50 CC„_WE enables the writing of the arithmetic result bits 
in the condition code register. 

PMEM_WE enables writes into packet memory PMEM 
L204. 

RESVEC_WE enables stores in the bit-oriented result 
ss vector RESVEC 1208. 

SPECOP_EN enables special operations including writ- 
ing to PCNT 1222, NCNT 1224, BDST_CNT 1226, and 
other functions listed in Table 22. 
The pipeline is 3 stages deep as shown in FIG. 14. The 
60 Fetch stage 1302 has been described above. The Decode 
stage 1304 takes place from the output of I-REG 1216 to the 
inputs of D-REG 1212, PC 1234 and RESVEC 1208. The 
Execute stage 1306 takes place from the output of D-REG 
1212 to the inputs of all general purpose registers and special 
65 purpose registers listed in table 17; ALUOUT can be written 
to GPREG 1206, MEM_ADDR 1254, DFIFO_W 1252, 
the CTRL JILL registers 1210, and the special registers in 
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block 1270. FIG. 14 shows in detail what occurs in each 
stage of the pipeline,- and .at what stage, various types of 
results are retired. Pipeline stall conditions suppress all of 
the WE bits so that the same condition holds from once cycle 
to the next, until the stall condition clears. Since this stall 
condition affects all microcode-controlled changes of state 
in the CE, it is implicit in all subsequent discussion of 
operation of the pipeline and the effect of stalls needs no 
further discussion. The causes of pipeline stalls are 
described in subsequent sections. 
1.2 Program Flow Control 

The address generation unit 1230 determines what 
address will be used to fetch the next microword from 
CMEM. The Program Counter (PC) 1234 contains the 
address of the current instruction being fetched. If 
BRANCH_EN is a '0' then the next value of PC is an 
increment of the current value; with no branches the 
microsequencer fetches micro words sequentially from 
CMEM. When BRANCH__EN is asserted a test of condition 
codes listed in Table 21 is done as selected by bits CCSEL 
[4:0] and inverted by FALSE, both fields described in Table 
16. If the condition test returns a '¥ then the conditional 
branch will be taken, otherwise PC 1234 will be loaded with 
the increment of its current value. The bit REG is tested; if 
it is 4 0' then the address PC is added to the value of the bits 25 
BRANCH _ADDR[9:0] to generate the branch value of PC; 
if it is 'V then the address PC is added to the value on bus 
REGB[9:0] to generate the branch value. The bus REGB 
carries the output of GPREG 1206 port DOl, which carries 
the value of the general purpose register selected with bits 
RSRCB[2:0]. 

Next bit RET is tested. If it is a the PC is leaded with 
the output of the microstack 1232, and the microstacks's 
stack pointer is decremented by 1. The microstack 1232 is a 
Last-in, First-out LIFO structure used to support micro - 
subroutines, nested up to 8 deep. If RET was a '0* then PC 
is loaded with the calculated branch value described above 
instead, and CALL is examined. If CALL is a '1' then the 
microstack 1232 has its stack pointer incremented, and the 
incremented value of the previous PC is written into the 
microstack using the new value of the stack pointer, In this 
way the address stored in the microstack 1232 when a CALL 
is executed is the address of the next instruction that would 
have been executed sequentially if the branch had not 
succeeded; thus when calling a subroutine it is the address 
of the next instruction to return to after executing a RET to 
terminate the subroutine. 

Since all program flow control decisions are made in the 
Decode stage 1304, the sequential instruction which follows 
is already in the fetch stage and is always executed. This^ so 
means that there is always a 1-cycle delay between fetchin g 
a success ful BRANC H__EN instruction and its effect on PC . 
ihe lnstr uctioiT' whicn follows a branch instruction, and is 
always e xecuted regardless of the success or failure of the 
branch, is cail6d a delay-slo t instruction. A delay-slot $5 
instruction may not have BKANCH_EN set. The return 
value stored in the microstack 1232 after a successful CALL 
is the address of the instruction following the delay slot 
instruction of the CALL. 

The microstack 1232 in the preferred embodiment of the 60 
invention consists of 8 registers with a multiplexer (mux) 
selecting one of them as the microstack output. A single 3-bit 
counter is used as the stack pointer; it is decoded in such a 
way that the read address N is the write address (N+l) so that 
a read-and-decrement or write-and-increment can be 65 
executed in a single cycle. Attempting to execute a CALL 
when the microstack already has 8 valid entries in it, or 
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attempting to execute a RET when the microstack has no 
valid-entries in it, causes the pipeline to halt and signal 
STACK_ERROR status to the Policy Processor 244. 

CCSEL, FALSE, BRANCH_ADDR, RSRCB, REG, 
CALL and RET are all defined in Table 16. 

1.3 32-bit operations 

The Classification Engine has two distinct data domains; 
one is oriented around 32 -bit data, and the other is oriented 
around 1-bit boolean data in RESVEC 1208 and the Bit ALU 
1260. There are a few places where data is communicated 
between these two domains. This section describes the 
32-bit domain. 

The 32-bit domain centers around selecting the A-side and 
B-side operands which are then fed into AIN and BIN of the 
ALU 1214. The output ALUOUT from ALU 1214 is then 
written back to one of the 32-bit destinations, and optionally 
the arithmetic condition codes are set if CC_WE is '1*. The 
ALU 1214 is a 32-bit Arithmetic and Logic Unit which 
performs any of the arithmetic functions listed in Table 19 
or any of the logic functions listed in Table 20 under control 
of the bits ALUOP[5:0] defined in Table 16. 

GPREG 1206 is a 32-bit general-purpose register file 
comprising 8 32-bit registers. It has two read ports and one 
write port. Read port DO0 has the contents of the register 
selected by RSRCA[2:0], and read port DOl has the con- 
tents of the register selected by RSRCB]2:0]. The register 
selected by RDST[2:0] is written to with the value of 
ALU_OUT if RDST[3] is '0' and REG_WE is 'IMn order 
to make newly-generated register values available in the 
subsequent instruction, the pipeline delay of writing into 
GPREG and reading out the new value is squashed through 
use of Bypass Multiplexers 1221 and 1223, which are used 
to forward ALU_OUT to busses REGAand REGB if RDST 
of the instruction in the execute stage matches RSRCA or 
RSRCB, respectively, in the instruction in the decode stage, 
thus hiding the pipeline delay. The A-side operand is 
selected among the A-side sources listed in Table 17 by 
multiplexer 1225. The selected data is then sent into the 
split-add-mask-and-rotate unit 1240. BITS[31:16] of the 
data are added to bits[15:0] of the data in the adder 1248, and 
the 17-bit result is concanated with zeros in bits [31:17] to 
create the split-add result. The selected data is also sent to 
the Mask Unit 1242 where it is bitwised AND'ed with 
MASK[31:0] if MSK[1] is a T, or is passed through 
unmodified if MSK[1] is a '0 J ; the result from MASK 1242 
is sent through the ROTATE barrel-shifter 1244 where the 
data is rotated right by the number of bits specified in 
ROT{4:0] in the microword. Finally, MSK[0] is used to 
select between the split-add result and the mask-rotate result 
in multiplexer 1246, and the result is presented to D-REG 
1212 as the A-side operand for the execute stage 1306. The 
B-side operand is selected among the B-side sources listed 
in Table 18 using multiplexer 1228, and is presented to the 
D-REG 1212 as the B-side operand for the execute stage 
1306. 

RSRCA, RSRCB, ALUOP[5:0], RDST[3:0], MASK 
[31:0], MSKfL], MSK[0], ROT[4:0] are all described in 
Table 16. 

1.4 PMEM 

Packet Memory (PMEM) 1204 is a (32-bit by 512-entry) 
RAM with on read port and one write port used to hold some 
or all of the packet being processed, and also to hold 
arbitrary data generated by the program. PMEM 1204 can be 
written from two sources; DF1FO __R 1250, or the REGA 
bus from the general-purpose registers GPREG 1206, where 
the register is selected by RSRCA[2:0], such writes occur 
when PMEM__WE is a l V in the microword. PMEM is read 
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as one of the A-side sources selectable as one of the "special 
register" sources. _ - _ . - _ _ . . _ 

PMEM 1204 addressing depends on the state bit USE_ 
PCNT. When USE_PCNT is '0' then PMEM 1204 is 
addressed by PINDEX[10:2] from the microword. When 5 
USE_PCNT is '1' then the address to PMEM 1204 is 
provided by the counter/register PCNT 1222. USE_PCNT 

is set and cleared via special operations. When SPECOP 

EN is '1' and LD_PCNT is <1\ then PCNT_REG is 
examined. If it is a "1" then PCNT is loaded with the value 10 
of bits [10:2] of the general-purpose register in GPREG 
1206 selected by RSRCB[2:0]; alternatively if PCNT_JIEG 
is a "0" then PCNT is loaded with the value of PINDEX 
[10:2] in the microword. In either case the state bit USE__ 
PCNT is set. Additionally, bit PCNT_JNC is examined, if it 15 
is a "1" then PCNT_JNC_310DE is set, or if it is a "0" then 
PCNT_INC _MODE is cleared. The state bit PCNT_ 
INC_310DE determines if PCNT 1222 holds a static value 
during the PCNT__MODE period, or if increments by one 
each time PMEM is written to or is used as a register source. 20 
USE_PCNT clears when an instruction has SPECOP_EN 
equal to "1" and UNLOCIC_PCNT also equal to "1". 

DFlFO_R, RSRCA[3:0], RSRCB[3:0], PINDEX[10:2] 
are all defined in Table 16, LD_PCNT, PCNT_REG, 
PCNT_INC, UNLOCK _J»CNT are all denned in Table 22. 25 
1.5 Interface to Memory 260 

SDRAM Memory 260 can be read and written by the 
microcode. The memory interface visible to the microcode 
consists of the MEM_^ADDR FIFO 1254, the write data 
FIFO DFIFO_W 1252, and the read data FIFO DFIFO_R 30 
1250. Writes to memory 260 are called stores, and reads 
from memory 260 are called loads. Loads and stores can be 
of size 1, 2, 4, or 8 words of 32-bits each. The address of a 
memory access must be size- aligned for the specified burst; 
that is, the address for a 2-word memory access must be on 35 
an 8-byte boundary, the address of an 8-word access must be 
on a 32-byte boundary, etc. 

To schedule a store, precisely the number of words for the 
specuied s ize of transfer are written to the special regist er 
.destination UMt*'u_W 1252, then the address (along wi th 40 
control information MEM SIZET1:01 and MEM DI R= 
STOP ft) are written into the address flfo MEM_AD DR 
1254, which triggers the memory interface to issue the store . 
The microsequencer is decou pled from the memory syste m 
b y the FIFOs 1252 and 1254, and thus can continue ope ra- 45 
t ion while the memory interface processes the store opera- 
t ion. The FIFOs 1254 and 1252 can hold up to 8 address es 
a3b lo words of data, respectively, so that in general m ore 
than one store operation can be outstanding without stalling 
the pipeline. The entire pipeline stalls when the execute 50 
stage 1306 operation is a write to either MEM_^ADDR 1254 
or to DFIFO„W 1252 and the target FIFO does not have 
room for another word. The situation will clear as the FIFO 
drains its current operation to memory 260 so the stall 
condition is transient. 55 

T o schedule a load, the address (along with cont rol 
i nformation MEM SIZET1:01 and MEM DIR=LD > > is 
written to special register destination MEM_ADDR, and 
some time later the mi crocode can ob tai n the reQuesteddata 
from the read data FIFO DF1FO _J* l^() .'ijel!veenTn?time 60 
that the microsequencer scneduled the load operation and 
the time the data is consumed, there is latency to access the 
memory system 260. Ihe microcode can choose to execute 
any number of instructions between the time the load is 
scheduled in MEM _^ADDR 1254 and the data is consumed 65 
from DFIFO_R 1250, since the loads are non-blocking. 
However, if the microcode attempts to read data from 



DFIFO _R 1250 and there is no data available, the pipeline 
will stall until such time as requested data has returned from 
memory 260. More than one load can be scheduled before 
any data is consumed; DFIFO_R 1250 has room for up to 
16 doublewords (128 bytes) of data. 

The microcode is responsible for ensuring that it never 
attempts to read data from DFIFO_R 1250 when no more 
words of read data have been scheduled, nor to issue a store 
address to MEM_ADDR 1254 when DFIFO_W 1252 has 
not been written with precisely the number of words speci- 
fied in the size of the store. The microcode is also respon- 
sible for never oversubscribing DFIFO_Jl 1254, that is, 
scheduling more outstanding words of read data than 
DFIFO_R 1254 has room for. Any of these conditions is 
detected by error-checking logic in the CE which will halt 
the CE and report violations to the Policy Processor 244 if 
the memory system is used incorrectly. 
1.6 Bit-oriented operations 

RESVEC 1208 is a 1-bit by 512-entry register file with 
special characteristics. It has one write port and 3 read ports; 
this means that in any one instruction 3 bits can be read and 
one write can be issued. The write can be to one bit, or to an 
adjacent pair of bits whose address differs only in the least 
significant bit, referred to here as an even-odd bit pair. For 
certain operations RESVEC 1208 can also be accessed as a 
32-bit by 16 -entry register file. 

When RESVEC_WE is a l Y and the microcode bit 2BIT 
is a £ 0' then a single bit in RESVEC 1208 is written with the 
data presented on the DIN0 data input port; that data is 
selected from among 4 different sources under control of the 
RESO_SEL[1:0] bits in the microword. Alternatively if 
2BIT is a 'V then the DIN0 data is written to the even- 
numbered bit in the destination, and DIN1 selected from 
among two sources by RES1_SEL is written to the odd- 
numbered bit of the pair. 

The destination address in RESVEC 1208 comes either 
from RES_BIT_DSlt9:0] if state bit USE__WCNT is '0', 
or from BDST_CNT 1226 if USE__WCNT is a '1'. USE_ 
WCNT is set when SPECOP_EN is ' 1' and LD„BDST_ 
CNT is a T.In that case BDST_CNT 126 is written with 
the value RES_BIT_DST[9:1]. At the same time BDST_ 
CNT 1226 is loaded, the bit BDST_CNT_MODE in the 
microword is examined. If it is '0* then BDST_CNT 1226 
is set to increment by 2, if it is 'V then BDST_CNT 1226 
is configured to increment by 32. The former is used in the 
special instruction CMPRN to sweep across sequential bit 
pairs in each cycle of the instruction and to write to them, 
while the latter is used for the RESVEC 1208 read address 
port RAO to sequentially read 32-bit groups of RESVEC 
1208 bits as the B-side special register RES_VEC. 

The bit-oriented ALU 1260 contains two boolean logic 
units 1264 and 1268 and one gang operation unit 1262. 
Boolean logic unit 1264 takes the two bits selected by 
RES„BIT_SRC_A[9:0] and RES_BIT_SRC_B[9:0] 
and applies the boolean operation BITOPAB[3:0] as speci- 
fied in table 20. The 1-bit result RES_BIT0 is one of the 
potential sources for write data port DIN0 on RESVEC 
1208. Boolean logic unit 1268 similarly takes the operands 
selected by RES_BIT__SRC_A[9:0] and RES_JBIT_ 
SRC_C[9:0] and applies BITOPAC[3:0] is a substantially 
similar manner, generating the 1-bit result RES_BIT1 
which may be selected as the DIN1 write data source if 2BIT 
is '1*. Thus in one cycle up to two bitwise boolean opera- 
tions can be executed if the two operations have one 
common operand. The GANG OP unit 1262 takes the 32 
adjacent bits from RESVEC 1208 selected by RES__BIT_ 
SRC_A[9:5] and treats them as a word operand. MASK 
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[31:0] is used to select which bits of that work will contrib- 
ute to the gang results, -then .an AND, OR, NAND, or NOR - 
operation is performed od all of the selected bits as 
instructed in GANGOP[1:0], and the result bit RES_GANG 
is presented as one of the possible sources for DINO on 5 
RESVEC 1208. 

The condition code selected by CCSEL[4:0] and option- 
ally inverted with FALSE can also be selected as the data 
source for port DINO. 

The remaining sources for DINO and DIN1 on RESVEC 10 
1208 are the CMPR_^A, CMPR_B result bits from one 
cycle of a bulk comparison instruction CMPRN, described 
below. 

RESVEC 1208 address fields for sources and destination 
are specified as 10 bits, even though only 9 bits are used in 15 
the preferred embodiment; the extra bit allows for a dou- 
bling of the size of RESVEC 1208 in future generations of 
the device. 

Writes to RESVEC 1208 are retired at the end of the 
Decode stage 1304 and can thus be used immediately as an 20 
operand in the subsequent instruction, without need for 
bypassing as is done with GPREG 1206. 

2BIT, RES0_SEL[1:0], RES1__SEL, BITOPAB, 
BITOPAC, GANGOP[1:0], RES_BIT__DSTI9:0], RES_ 
BIT_SRC_A[9:0],RES_BIT„SRC_B[9:0], RES_BIT_ 25 
SRC_C[9:0], MASK[31:0], CCSEL[4:0], FALSE are all 
defined in Table 16. 

LD_BDST_CNT, BDST_CNT_MODE are specified 
in Table 22. 

1.7 Bulk comparisons 30 

When SPECOP_EN is '1' and LD_NCNTis also * 1', the 
instruction cycle counter N_CNT 1224 is loaded with the 
value NCNT[6:0] (bits[22:16] of the microword) and the 
state bit CMPRN is set. LD_BDST__CNT is required to 
also be a '1' for this instruction, and BDST_CNT_MODE 35 
must be a '0*. BDST_CNT 1226 is loaded with the value 
RES_BIT_DSTI9:1]. GPREG 1206 is locked with the 
A-side select RSRCA[2:0] and the B-side select RSRCB 
[2:0]. The bit CLEAR_HIT is required to be a ' 1 * also in this 
instruction, which has the effect of setting the condition code 40 
register bits MTCH_A, MTCH_B, MTCH_AORB, 
MTCH_AANDB all to zero. 

For the next N cycles, until N__CNT 1224 has decre- 
mented to zero, interpretation of the 64-bit microword is 
suppressed and all 64 bits are treated as data instead. In each 45 
of these cycles the microword bits (63:32] are compared to 
the selected A-side register value REGA using comparator 
1220 to produce the result CMPR_A if they are equal; and 
microword bits [31:0] are compared to the selected B-side 
register value REGB using comparator 1227 to produce 50 
result CMPR_B if they are equal. During CMPRN the 
RESVEC unit 1208 is locked into a mode where 2BIT is true 
and RES0_SEL and RES1__SEL select CMPR_A, 
CMPR_B respectively. The results CMPR_A and 
CMPR_B are stored to the even-odd pair of bits in RES- 55 
VEC 1208 selected by BDST_CNT 1226, then BDST_ 
CNT 1226 is incremented, NCNT 1224 is decremented, and 
the process repeats until NCNT 1224 equals zero. At that 
point the state bits USE_BDST_CNT and CMPRN clear 
and the pipeline goes back to normal operation where every 60 
microword is interpreted. 

During every comparison cycle of the CMPRN 
instruction, if CMPR_A is a l Y then the condition code bit 
MTCH_A will set and will stay set. Similarly if CMPR_B 
is a l V during any of those cycles then bit MTCH_B will 65 
set and will stay set. If either CMPR_Aor CMPR_B is true 
during any of these cycles then condition code bit MTCH 
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AORB will set and will stay set. Finally, if CMPR_A and 
CMPR^B are both '1' during a CMPRN-compare cycle, 
then MTCTL^AANDB will set and will stay set to indicate 
that a 64-bit match was encountered. 

By loading one or two registers in GPREG 1206 with 
comparison values prior to executing the CMPRN 
instruction, a single value can be compared to (2*N) values 
in a table, or two different values can each be compared to 
(N) values, in ((2*N)+1) execution cycles. 

RES JIT_DSTt°:0], RSRCA[3:0], RSRCB[3:0], 2BIT, 
RES0_SEL, RES1_SEL are specified in Table 16. 

LD_NCNT, LD_BDST_CNT, CLEAR _HIT are speci- 
fied in Table 22. 

1.8 Special Operations 

In addition to the special operations mentioned so far, 
there are other administrative functions which are enabled 
with SPECOP_EN and decoded from the bits specified in 
Table 22. Decode of these functions and any decode neces- 
sary for implementing the instruction set specified take place 
in the decoder block DCD 1272. 

1.9 CMEM Fills 

The microstore CMEM 1202 is filled either via a series of 
PIO write accesses from the Policy Processor 244 or Appli- 
cation Processor 302, or can be loaded by use of the 
CTRL_FILL unit 1210. The registers in CTRL_FILL 1210 
are loaded with an address in memory 260, an address in 
CMEM 1202, and a count of the number of instructions to 
be loaded. With the CE pipeline halted, the CTRL_FILL 
unit will execute this transfer. 

The transfer may be initiated by the Policy Processor 244, 
the Application Processor 302, or can be initiated by micro- 
code running on the CE, in which case the CTRL__FILL 
1210 registers appear as special register destination as 
shown in Table 17, and the operation is triggered with an 
instruction which has SPECOP_EN equal to * 1 and HALT 
and DO_CMEM JILL asserted. After the transfer 
completes, microcode can then continue execution, includ- 
ing the newly downloaded code. The CE can only toad and 
launch itself if microcode to do so is already resident in 
CMEM 1202 and if the host has configured the CE to allow 
it to do so. 

HALT and DO_CMEM_FILL are specified in Table 22. 
2. CE Programming Languages 

CE programs can be written directly in binary; however 
for programmer convenience a microassembly language 
uasm has been developed which allows a microword to be 
constructed by declaring fields and their values in a sym- 
bolic form. The set of common microwords for the intended 
use of the CE have also been described in a higher-level CE 
Assembly Language called masm which allows the pro- 
grammer to describe operations in a register-transfer format 
and to describe concurrent operations without having to 
worry about the details of microcode control of the under- 
lying hardware. Both of these languages can be used by a 
programmer or can be generated automatically from a 
compiler which translates CE programs from a higher-level 
language such as NetBoost Classification Language (NCL). 

V Microprogramming Guide 
The 64-bit CE instruction word is raw microcode; some 
bits enable retirement of operations by writing to one or 
more units, and the rest are used to steer different data paths 
and to provide control codes to various units in parallel. 
Depending on which results are retired, the fields in the 
microword have different meaning. There are 7 different 
ways that the microword is interpreted; even though all 
steering is really done in parallel, these 7 instruction formats 
show which sets of fields can be used without conflict. 
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There are 7 bits that are constant in all formats; these are 
the bits that enable stores into various units. -These bits are 
{REG___WE, RESVEC_WE, CC_WE, reserved, PMEM_ 
WE, BRANCH_EN, and SPECOP_EN}, which are 
assigned in that order to bits [63:57] of the microword and 
are described in Table 16. The remaining bits are assigned to 
control points as shown in FIG. 13 and are defined in the 
following sections. 

As shown in FIG. 14, the CE is implemented as a 3-stage 
pipeline; each instruction passes through the three stages 
Fetch 1302, Decode 1304, and Execute 1306; at any time 
there are three different instructions being processed. The 
figure shows what processes occur in each stage of the 
pipeline, and helps illustrate behavior of the pipeline shown 
in FIG. 13. When the pipeline stalls all three stages stall 
together in lockstep. 

Most word-oriented operations pass one operand through 
either the mask/shift unit or the split-add unit and then all 
work-oriented operations pass through the Execute-stage 
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there is no delay between creation of a result and use of it 
in a subsequent operation. Sirnilarly, -use of condition codes 
for BRANCH (conditional flow control) or BSET (setting a 
selected RESVEC bit to the result of a condition code test), 
or reads of CC_REG (Condition Code Register) when the 
bits are being updated requires bypassing. 

Other registers (e.g. BASE_REG) do not have forward- 
ing so the software must delay one clock after writing them 
before using the result. 

1. Microword Format Definitions 
1.1 MOV, ALU, and LDST operations 
REG_WE is set. 

These instructions select 1 or 2 sources among GPREG 
and SPREG, do a mask/shift or split-add of the A-side 
operand, then pass them through the ALU and store the 
result to an SPREG or GPREG. Condition codes Z, N, V, SZ, 
and CY are optionally set by this operation if CC_WE is set. 



TABLE 9 



MOV and ALU formats 



6 
3 


5 

7 


5 5 

6 3 


5 5 
2 1 


5 4 

0 6 


4 4 

5 0 


3 3 
9 6 


3 
5 


3 
2 


WRITE_ENABLES [6 :0] 


RDST[3:0] 


MSK 
[1:0) 


ROT[4:0] 


ALUOP[5:0] 


RSRCB[3:0] 


RSRCA[3:0] 


MASKor IMMED 


3 
1 




0 
0 



TABLE 10 



MOV and ALU formats with PMEM sre 



6 
3 


5 
7 


5 5 

6 3 


5 5 
2 1 


5 4 
0 6 


4 
5 


4 3 
4 6 


3 

5 , 


3 
2 


WRITE_ENABLES[6:0] 


RDST[3:0] 


MSK 
[1:0] 


ROT[4:0] 


1 


PINDEX[10:2] 


RSRCA[3:0] 


MASK or IMMED 


3 
1 




0 
0 



50 

ALU before being retired. Any consumer of a newly- Note that with PMEM [immediate_index] as a source the 

produced GPREG value actually receives a forwarded copy ALU is bypassed (except for sign and zero-detect); however, 

of the current ALU output via some bypass logic so that mask/rotate or split-add are still available. 
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TABLE 11 



LDST format 



JE 



WRITE_ENABLES[6:0] 



RDST[3:0] 



MSK 
[1:0] 



ROT[4:0] 



ALUOP[5:0] 



RSRCB[3:0] 



RSRCA[3:0] 



(a) 



IMMED_ADDR[27:0] 



(a) SIZE[l:0] 

(b) DIR 



1.2 BIT_OP 

Bitops and gangops have RESVEC_WE set. These 
instructions select a bit RES_BIT_DST in RESVEC as a 
destination to which the RESO result is written; and if 
(optionally 2BIT is set, then RES_BIT_DST is treated as 
the pointer to an adjacent pair of bits where the first has an 
even address and the second has the next (odd) address. With 
2BIT the odd bit is written with the RES1 result. 

Depending on the value of the field RES0_SEL, the 
RESO result may come from a boolean operation BITOPAB 
performed on the operands selected by RES_BIT_SRC_A 
and RES_BIT_SRC_B, or the result of a GANG operation 
performed on bits in the group of 32 RESVEC bits selected 
by RES_BIT_SRC_A[9:5] and further selected by the "1" 
bits in the 32-bit immediate MASK field, or the selected and 
optionally inverted condition code bit selected by CCSEL 
and FALSE, or the A-side result of a bulk table comparison 
CMPR _A 

If RES1 is being written to the odd bit of a pair, the RES1 
result is selected by RES1_SEL to be either the result of the 
arbitrary boolean operation BITOPAC performed on the 
operands selected by RES_BIT_SRC_A and RES_BIT_ 
SCR_C, or the B-side result of a bulk table comparison 
CMPR_B. 



20 



25 



30 



35 



40 



TABLE 12 

BtT_OP Format 



6 
3 


5 
7 


5 4 

6 7 


4 4 

6 5 


4 
4 


4 

3 




4 
V 


3 3 
7 6 


Is 


WRITE_ENABLES(6:0] 


RES_BrT_DST[9:0] 


a 


b 


c 


RES_BIT_SRC_A[9:0] 


RES_BIT_SRC_B[9:0] 


RES_BLT_SRC_C[9:0] 










BITOPAB [3:0] 


BITOP_AC[3:0J 






d 


CCSEL[4:0] 


3 
1 




2 
7 


2 1 1 2 
6 | | 2 


N I! 


1 
1 


1 
0 


0 
9 


0 
8 


0 1 1 0 
7 1 | 4 


0 1 |o 

3 | |0 



(a) RES0_SEL[1:0] 

(b) 2B1T 

(c) RES1__SEL 

(d) FALSE (selects gender of CCMUX output; 0 - as is, I - inverted) 
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TABLE 13 



52 



GANG_OP Format 



6 




5 


5 




4 


4 


4 


4 


4 


4 


4 




3 










3 


3 




7 


6 




7 


6 


5 


4 


3 


2 


1 




7 










2 


WRITE_ENABLES(6:0] 


RES_BIT_DST[9:0] 


1 


D 


0 


a 


RES_BIT_SRC_A[9:5] 












MASK 



(a) GANG_OP[1:0] 

1.4 Branch 

BRANCH_EN is always set in this format. Note that a 
register-to-register aluop can be folded into the same 
instruction as long as there are no other field conflicts. 



15 



TABLE 14 



Branch Format 



WRITE_ENABLES [6:0] 



RSRC[3:0] 



CC_SEL[4:0] 



RES_BIT_SRC_C[9 :0] 



BRANCH_ADDR[9:0] 



3 3 
1 0 



(a) FALSE (selects gender of CCMUX output; 0 = as is, 1 = inverted) 

(b) CALL 

(c) RET 

(d) REG (selects GPREG ('1') or immediate value ('0') for branch 



1.5 SPECOP 

Special Operation bits (which are all qualified with 40 
SPECOP_EN) are defined in Section Table 22 on page 94. 
The instructions cmpm, setpcnt[i], and set_resvec_jndex 
also use some specop fields. 



TABLE 15 

SPECOP Format 



6 
3 




5 
7 


5 
6 




4 

7 


4 4 

6 5 


4 
4 


4 

3 




3| 

9\ 6 


3 
5 


\i 


WRITE_ENABLES[6:0] 


RES_BIT_DST[9:0] or WCNTJ9.-0] 


00 


b 


c 






RSRCB[3:0] 


RSRCA[3:0] 


-or- PINDEX[10:2] 


? 










N_CNT[6:0] 


MSG[3:0] 














3 
1 




2 
2 


1 

6 




0 
0 



(a) RES0_SEL(1:0] (for CMPRN) 

(b) 2BtT (for CMPRN) 

(c) RES 1_SEL (for CMPRN) 

(*) The interpretation of these bits is defined in Table 22 of page 94. 
(?) Undefined but reserved for future special operations 
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1.6 Control Field Definitions 



TABLE 16 



Control Fields 



Signals 



Function 



Bits 



WE[6:0] These are the fixed- format signals which retire results (unless the pipeline is [63:57] 

stalled); they are: 

[0] SPECOP_EN: enables special ops as defined in 9.2.5. 

[I] BRANCH_EN: Enables a conditional program flow control operation 
[2] PMEM_WE: Enables stores into PMEM 

[3] reserved 

[4] CC_WE: Enables store to CC_Z, CC_CY, CC_SZ, CC_V, CC_N 
[5] RESVEC_WE: Enables stores to the result bit vector 

[6] REG_WE: Enables stores of ALLL.OUT into the GPREG file if (RDSTT3] - 0), 

or into SPREG's if (RDST[3] - 1). 
RSRCA(3:0]. Selects a GPREG to drive out on DOUT0 (using [2:0D and selects between [35:32] 

GPREG and SPREG sources on the mux to SPLIT-ADD and MASK using [3] 
RSRCB[3:0] Selects a GPREG to drive out on DOUT1 (using [2:0]) and selectes between that [39 36] 

and SPREG sources on the ALUB input mux 
RDST[3:0J Selects which GPREG to enable the WE onto with [2:0] if [3] 0; and if [3] == 1, [56:53] 

[2:0] is decoded to select which SPREG to write to. 
ROT[4:0] Steers the 32-bit barrel shifter [50:46] 

MSK[1] If [1] then masking is enabled; if [0] then pass-thru [52] 

MSK[0] If [1] selects MASK/ROTATE output, if [0] selects SPUT_Jtf3D output, on ALUA [51] 

input mux. 

ALUOP[5:4] [lx] selects ALUA input as ALU_OUT The reason for this is to enable a MOV [45:44] 

from PMEM[ index] with mask and rot; but we love ALUOP due to bit overlays, so 
we can't use the ALU in the same instruction. 
[00] selects ADDER output 
[01] selects LOGIC output 

ALUOP[3:0] On LOGIC unit, these 4 bits are the mux inputs steered by the bit pairs. [43:40] 

ALUOP[1:0] Selects CY__LN to ADDER: [41:40] 

[00] selects "0" 

[01] selects "1" (for subtracts) 

[lx] selects CC_*EG_CY 

ALUOP[2] If T, inverts ADDER input on the A port [42] 

ALUOP[3] If '1\ inverts ADDER input on the B port [43] 

IMMEDIATE 32-bit immediate value used on ALUB input path; if (RDST ««= MEM_ADDR) [31:0] 

then only bits [27:0] arc used 
MASK 32-bit immediate value used in MASK and GANG_OP units for bit masking; [31:0] 

AND'ed with the input value 
PINDEX[10:2] Used to address words in PMEM for MOV operations and for loading PCNT for [44:36] 

sequential pmem operations. a.k.a. tNDEX[8:0] 
MEM_SIZE[1:0] In LDST format, indicates the size to MEM^ADDR: [31:30] 

[00]: 1 word 

[01]: 2 words (only aligned double- word allowed) 
[10]: 4 words (aligned on a 16-byte boundary) 

[II] : 8-word burst (aligned on an 8-word (32-byte) boundary) 

Note that hardware masks the lower address bits to force size-alignment 
MEM_DIR In 1DST format, [1] is a store, [0] is a load from memroy [29] 

RES_BIT_SRC_A Selects a bit of the 512-bit result vector; bit [9] is not connected, leaves room for [41:32] 
[9:0] future growth. Bits[8:5] select the word to port W0[31 :0] on the file. Bits[4:0] 

select the bit within the word to port B0 
RES_BIT_SRC_B Same as above, but to word Wl and bit Bl. [31:22] 
[9:0] 

RES_BIT__SRC_C Same as above, but to word W2 and bit B2. [21:12] 
[9:0] 

RES_J5IT__DST [9] is reserved for future growth. [8:5] are decoded to a row select, and [4:0] are [56:47] 
[9:0] decoded to a column select for enabling the bit write. 

RES0_SE1J1:0] Mux select for the DDST0 bit to RESVEC; [46:45] 
[00]: CMPR^V 
[01]: RES_BIT0 
[10]: RES_GANG 

[11]: COND_CODE as selected by {FALSE, CC__SEL[4:0]} 
RES1_SEL Mux select for the DIN1 bit to RESVEC, used if 2BLT is set; [43] 

[0]: CMPR_B 
[1]: RES_BIT1 

2 BIT Enables next- neighbor write to odd-numbered bits in RESVEC, for operations with [44] 

two results (dbitop, cmprn) 

BITOP_ J AB[3:0] These bits are selected by {BIT1, BIT0} provide arbitrary boolean functions on [7:4] 

the bits: {00}— >[01L {01}— >[ll {10}— >[2l {11}— >[3] 
GANG_OP[l] Mux steering. T— AND, '0'— OR [43] 
GANG_OP[0] Inverts result if '1' to create NAND or NOR [42] 
BRANCH[9;0] If BRANCH condition passes, this is the signed relative branch offset in CMEM [9:0] 
CALL Loads a copy of (PC+1) into the microstack; timed so that the address saved is one [31] 

past the branch delay slot, and bumps microstack pointer 
RET Forces the contents of the microstack register into the PC reg and decrements the [30] 

microstack pointer 
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- - -TABLE 16-continued 



Control Fields 



Signals 


Function 


Bits 


BRANCH_REG 


If 'V, branch to REG„B output on a branch/call; if '0' branch to the immediate 


[29] 




value 




FALSE 


If *r, invert the output of the CC_MUX 


[27] 


CC_SEL[4:0] 


Selects a condition code bit for a branch decision 


[26:22] 


Special ops 


Defined in "SPECOF bit assignments" on page 46 





2. Register Select Codes 

2.1 A-side Operands and Destination Registers 

TABLE 17 



Register Select Codes for Destinations and for A-side Sources 





REG[3] - 0, Src. 






REG[2:0] 


. or Dst. 


REGf3] = 1, Dst. 


REG[3] = 1, Src 


ObOOO 


GPREGO (g0) 


NULL (discard) 


CC_REG 


ObOOl 


GPREG1 (gl) 


BASE_REG 


BASE_REG 


ObOlO 


GPRE02 (gl) 


DFIFO_W 


DFIFO_R 


ObOll 


GPREG3 (g3) 


MEM_ADDR 


BASE_REG_MSK 


OblOO 


GPREG4 (g4) 




PMEM 


Ob 101 


GPREG5 (g5) 


CEFADR 




Obi 10 


GPREG6 (g<5) 


CESTART 




Obi 11 


GPREG7 (g7) 


CECNT 
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2.2 B-side Operands 30 



TABLE 18 



Register Select Codes for B-side Sources 



REG[2:0] 


REG{3] = 0 


REG[3] = 1 


ObOOO 


GPREGO (gO) 


IMMEDIATE 


ObOOl 


GPREG1 (gl) 


IMMED_^ADDR[27:0] ([31:28] are 0x0) 


ObOlO 


GPREG2 (g2) 


DURATION 


ObOll 


GPREG3 (g3) 


MEM_WAfT 


OblOO 


GPREG4 (g4) 


TIMER 


OblOl 


GPREG5 (g5) 


DIAG_REG 


Obi 10 


GPREG6 (go) 




Obi 11 


GPREG7 (g7) 


RESVEqi] 
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[1] Indirect addressing of RESVEC:RESVEC accesses a word of the result 
vector pointed to by WCNT (which was loaded via a specop) and then 
autoincrements the index. After the RESVEC store to dfifo is completed a 
resvec_index_unlock must be executed to enable random access to RES- 
VEC 



3. ALU and Logic Operations 
3.1 Adder Op Codes 

TABLE 19 



ALUOP Bit Specifications for ADDER fALUOPF4l » 0) 
OPERATION ALUOP[3:0] <ALUop> Name 55 

A + B ObOOOO ADD 

A + B + CY ObOOlO ADC 



TABLE 19-continued 

ALUOP Bit Specifications for ADDER (ALUOPI41 - 0) 



OPERATION ALUOP[3:0] <ALUop> Name 



A + B + 1 


ObOOO 1 


ADINC 


A - B 


OblOOl 


SUB 


A-B-CY(A+B + CY) 


OblOlO 


SUBB 


A- B - 1 


OblOOO 


SBDEC 


B -A 


ObOlOl 


SBR (Reverse) 


B - A - 1 


ObOlOO 


SBRDEC 


B-A-^Y*(A + B + CY) 


ObOllO 


SBRB 


3.2 Logic Op and BITOP Codes 




TABLE 20 




ALUOP Bit Specifications for LOGIC fALUOPf41 - X) 


OPERATION 


ALUOP[3:0] 


<ALUop> Name 


AND 


OblOOO 


AND 


OR 


OblllO 


OR 


XOR 


ObOllO 


XOR 


NAND 


ObOlll 


NAND 


NOR 


ObOOO 1 


NOR 


XNOR 


OblOOl 


XNOR 


INVERT_A 


ObOOll 


INVA 


INVERT_B 


ObOlOl 


INVB 


PASS_A 


ObllOO 


PASSA 


PASS_B 


OblOlO 


PASSB 


ZERO 


ObOOOO 


ZERO 


ONES 


Obllll 


ONES 


A_AND_NOT_B 


ObOlOO 


AANDNB 


B __AN D_NOT_A 


ObOOlO 


BANDNA 


B_OR_NOT_A 


Ob 1011 


BORNA 


A_OR_NOT_B 


ObllOl 


AORNB 



BITOP's and 32-bit Logic operations use the two operand 
bits as selects into a MUX which select among 4 bits 
provided in the instruction. The encoding for logic opera- 
tions uses the value of each pair of operand bits {A3} to 
select which bit of ALUOP[3:0] provides the result. Wheo 
the logic operation is performed on bit operands from 
RESVEC the bits {bsreb, bsrea} provide the same selection 
of bits from the BITOP field (that is, for bitopab we use 
{bl,b0} and for bitopac we use {b2,b0} as operands: 



Operand {bl,b0} or {b2,b0} (or bits of {opA.opB}) {1,1} {1,0} {0,1} {0,0} 

BITOP (or ALUOP) bit selected as the result BITOPAx[3] BITOPAx[2] BITOPAx[l] BITOPAx[0] 



65 



12/22/2003, EAST Version: 1.4.1 



6,157,955 



57 



4. Condition Code Selects 

Each of these values can be tested true or inverted based 
on bit "F" in the instruction. 

TABLE 21 



CC_SEL Bit 



Condition Code MUX values 



Notes 



ObOOOOO TRUE 

ObOOOOl CY 

ObOOOlO Z 

ObOOOll N 

ObOOlOO V 

ObOOlOl GT 

ObOOllO IT 

ObOOlll GB 

ObOlOOO LE 

ObOlOOl SZ 



ObOlOlO RX_RING 

ObOlOll RECLASS_RING 

ObOllOO PENT>__RD_WATT 

ObOllOl PEND_WR 

ObOlllO PEND _ADDR 

ObOllll RES_BIT 

Ob 10000 MSG_IN_A 

Ob 10001 MSG_IN_B 

Ob 10010 MSG_IN_C 

Ob 10011 MSG_IN_D 



OblOlOO SGT 

OblOlOl SLT 

Ob 10110 SGE 

OblOlll SLE 

Ob 11 000 PEND_RD_DATA 

Ob 11001 MTCH_AORB 

Obi 1010 MTCH_A 

Obi 10 11 \fTCH_B 

OblllOO MTCH_AANDB 



For unconditional branch 

Last saved Carry (or a bypass of it if the 

preceeding instruction had CC WE set) 

Last saved Zero (or a bypass of it) 

Sign bit of last result (or a bypass of it) 

Signed overflow (CY N) of last result 

(or a bypass of it) 

CY && Z (unsigned Greater Than) 

CY (unsigned Less Than) 

CY |1 Z (unsigned Greater Than or 

Equal) 

CY || Z (unsigned Less Than oi Equal) 
STICK^Z, set via a SPECOP. Each 

time CC Z is written, this bit will 

clear Lf CC_Z_J is '0', otherwise it 

holds its previous value. 

RX Ring has at least one buffer for this 

CE 

Reclassify Ring has at least one buffer 
for this CE 

There is a read pending for which some 
data has not yet arrived in DFIFO„R 
DFIFO_W has at least one word in it 
MEM _^ADDR has at least one address 
in it 

Selected bit of Result Vector (using bit2 
(port O) 

These are the message bits from the PP 
or AP to the microcode indicating that 
an action is to be taken (CTRL fill, 
bash insert or delete, etc). These are 
assigned by software convention. Note 
that when a B RANCH_ cc is made on 
any of these bits the associated CCREG 
bit will clear when the branch is taken. 
Z && N (Signed greater-than) 
Z && N (Signed less-than) 
Z || N (Signed greater-than-OT-equal) 
Z j| N (Signed less-than-or- equal) 
At least one word is available in 
DFIFO_R 

Any A- or B-sided operand matched 
during a cmprn instruction 
Any A-side operand matched during a 
cmprn instruction 

Any B-side operand matched during a 
cmprn instruction 

Any 64-bit A-B pair operand matched 
during a cmprn instruction 



5. Special Operation Fields 

These bits are enabled by SPECOP_EN. 



TABLE 22 



Bit 



Name 



SPECOP bit assignments 
Description 



[0] unlock_pcnt Puts PCNT counter back into normal 

immediate-P-index mode 

[1] unIoct_resvec_index Puts RESVEC index counter back into 
normal immediate mode 

[2] inc_rx_index Increments CE_CONS pointer in this 

CE's RX ring 

[3] inc redassify_index Increments CE_CONS pointer in this 

CE's RECLASS ring 

[4] clear _hit Clears CCR EG( MTCH_A, 

MTCH__B, MTCH_AORB, 
MTCH_AANDB] 
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TABLE-22-continued 



Bit 



Name 



SPECOP bit assignments 
Description 



[5] 
[6] 
[7] 



[8] 
[9] 

[10] 
[11] 



clear_duration 
reset__gpreg 
reset_resvec( ) 



reset resvec_l 5_1 

setsz 

do_cmem fill 

halt 



[15:12] seL_msg[3:0] 



[24] 
[25] 



20 



[26] 



ld_nent 
ld_bdst_cnt 



bdst_cnt_mode 



[27] ld_pcnt 



25 



[28] 

[29] 
[30] 



pcnt_reg 

pcnt_inc 
sleep 



Sets the DURATION counter to QxO 
Flash clear of GPREG[7:0] 
Flash clear of RESVEC[31:0]. Allows 
preservation of up to 32 global bit 
variables while clearing the rest 
Flash clear of RES VEC[51 1:32] 
Sets CC_REG[SZ] to '1' to start a 
chabed-equality compare 
Triggers a CMEM fill sequence 
Sets CSR[HALT] and freezes the CE 
pipeline 

Each bit sets one of the 4 MSG_OUT 
bits in CE_CSR 

loads N-counter for CMPRN instruction 
loads BDST counter, sets RESVEC 
sequential mode (for CMPRN & resvec 
spills) 

'0' - count-by-2 for CMPRN, '1' = 
count-by-32 for resvec spill 
Writes either PINDEX[10:2] or 
REGB[10:2] into PCNT and sets PCNT 
autoincrement mode per PCNT_INC 
With ld_pcnt, '0* « load with 
immediate, '1' «* load from gpreg on 
B-side 

With ld_pcnt, ' 1 * = pent auto- 
increments, '0' - no increments 
Freezes pipeline, sets CECSRfSLEEPl 
puts CMEM in power-down mode. 
Sleep mode persists until any of 
CECSR[RX__RING, RECLASS, 
MSG_IN[D:A]] causes a wake up. 



6. Miscellany 
35 6.1 Memory Scheduling Rules 

A memory access is scheduled by writing the address/ 

size/direction to the MEM^ADDR special register. The 

following rules apply to scheduling of memory accesses; 

violation of any of these rules will cause the pipeline to 
40 HALT with status of the cause of the error in the CE Control 

and Status Register (CECSR). 

1) There must be at least one intervening instruction 
between a LD and use of the resulting data if no other read 
data is outstanding. A load followed by immediate consump- 

45 tion when the outstanding schedule is '0' will result in a 
deadlock. 

2) A maximum of 16 slots of read data can be scheduled. 
A slot is a 2-word entry in DFIFO_R. A LD or LD2 
consumes 1 slot, a LD4 consumes 2 slots, and a LD8 

50 consumes 4 slots in DFIFO_R. The appropriate number of 
slots must be available before another {LD, LD2, LD4, 
LD8} is scheduled. 

3) A maximum of 32 outstanding words of read data can 
be scheduled; data must be consumed to make room in 

55 DFIFO_R before more can be scheduled. 

4) Precisely the correct number of words of write data 
must be written to DFIFO_W prior to scheduling the store 
of that size. 

6.2 Register Write-Use Rules 
60 GPREG and RESVEC results can safely be accessed in 
the instruction after the data is written to them. 

PCNT, WCNT, and NCNT are all loaded via use of a 
specop. They can safely be used immediately in the next 
instruction. 

65 The specop unlock_pcnt takes effect immediately, so 
PMEM immediate index can safely be used in the next 
instruction. Likewise, specop unlocJc_resvec_Jndex takes 
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effect immediately, and random access to RESVEC can be 
. used in the next instruction. - 

BASE_REG has a one -cycle write-use delay rule; it is 
written to an instruction A, it cannot be used as a source 
operand in instruction A+l. 5 

PMEM has a one cycle write-use delay rule for any 
particular address. If address addr is written to iu instruction 
A, then addr may not be read in instructioD A+l; however it 
is perfectly safe to read any other location in PMEM in cycle 
A+l. 10 

Data written to special register NULL may not be read 
back because, well, it's gone, man. 

6.3 PMEM Addressing 

Packet Memory PMEM can be addressed by an immedi- 
ate index provided in the micro word, indirectly from the 15 
PCNT register, or indirectly with auto-increment of PCNT. 
Immediate indexing is the standard mode; use of PCNT is 
initiated with the ld_pcnt special operation, which also 
carries the mode bit pcnt_inc that can optionally be 
asserted. This special operation sets the state bits USE_ 20 
PCNT and (optionally) PCNT_INC_JvlODE. USE_PCNT 
is cleared by the special operation unlock_pcnt. 

PCNT can be loaded from an immediate value PINDEX 
provided in the ld_pcnt special operation, or from bits 
[10:2] of any GPREG specified in RSRCB if the specop bit 25 
pcnt_reg is set during the ld_pcnt. 

6.4 Microstack 

The microstack is written and the stack pointer is incre- 
mented every time a conditional CALL instruction succeeds. 
It is read and the stack pointer is decremented every time a 30 
conditional RET instruction succeeds. The address written is 
the address of the instruction following the delay slot of the 
call, since the delay slot is always executed. The microstack 
holds up to 8 entries. Calling to a depth greater than 8, or 
returning past the valid number of entries, causes a halt with 35 
a report of STACK_ERROR in the CECSR. 

VI. Programming Model 

This se ction describ es the programming model and set of 
abstracti ons employed whencr eating an applicatio n for the 4Q 
NetBoost plattorm (i.e., t he platform described in this pate nt 
af pTjp aH bn). An application on the NetBoost platform is t o 
be considered a service, provided within the network, th at 
may require direct knowledge or manipulation of network 
packets o r frame 's. The programming model provides for 45 
direct" access to" low^ level frame d ata, plus a set of librar y 
functions capable' ot re assembling lo w-ley^lirarne d^ a into 
higher-laver; messages~o r packets . In_addition. the library 
contains functions capaoTe of performing protocol opera - 
tions on network or transport-layer messages .. $q 
" An application developed for the NetBoost platform 
receives link-layer frames from an attached network 
interface, matches the frames against some set of selection 
criteria, and determines their disposition. Frame processing 
takes place as a sequence of serialized processing steps, 55 
Each step includes a classification and action phase. During 
the classification phase, frame data is compared against 
application-specified matching criteria called rules. When a 
rule's matching criteria evaluates true, its action portion 
specifies the disposition of the frame. Execution of the eo 
action portion constitutes the action Phase. Only the actions 
of rules with true matching criteria are executed. 

I mplementing a n application for the NetBoost platform 
involves parti tio ning tne applicatio n into two mocnlle s. 
Modules are agroup ing ol application codedest ined to 65 
ex ecute in a particular portion of the NetBoo^piatlorrn . 
There are two modules required: tne applxcatjorTprocessor 



(AP) module, and the policy engine (PE) module. Applica- 
ti on code in the AP module^runs on-the host processor 1 ano T 
i sSaost appropriate for processing not requiring wire-spe ed 
a ccess to network frames. Application code for the P E 
module comprises the set of classification ruleswritten in 
th e NetBoost Classification Language (NCL), an( T an 
accompanying set of compiled actions (C or C++ functi ons/ 
nRfer.ts) PR actings are ahift f p manipulate networ k frames 
with overhead. anti are thus the ap propriate meciMniCTTTfcl 
im plementin g fast and simple manipula tion of frame data. 
The execution environment for PE action code is more 
restricted than that of AP code (no virtual memory or 
threads), but includes a library providing efficient imple- 
mentation for common frame manipulation tasks (see Sec- 
tion VIII). A message passing facility allows for communi- 
cation between PE action code and the AP module. 
1. Application Structure 

FIG. 15 illustrates the NetBoost application structure. 

Applications 1402 written for the NetBoost platform must 
be partitioned into the following modules and sub -modules, 
as illustrated in FIG. IS. 

AP Module (-application processor (host) module) 1406 

PE Module (-policy engine module) 1408 

Classification rules-specified in NCL 

Action implementation-object code provided by app 
developer 

The AP module 1406 executes in the programming envi- 
ronment of a standard operating system and has access to all 
PEs 1408 available on the system, plus the conventional 
APIs implemented in the host operating system. Thus, the 
AP module 1406 has the capability of performing both 
frame-level processing (in conjunction with the PE), or 
traditional network processing using a standard API. 

The PE 1408 module is subdivided into a set of classifi- 
cation rules and actions. Classification rules are specified in 
the NetBoost Classification Language (NCL) and are com- 
piled on-the-fly by a fast incremental compiler provided by 
NetBoost. Actions are implemented as relocatable object 
code provided by the application developer. A dynamic 
linker/loader included with the NetBoost platform is capable 
of linking and loading the classification rules with the action 
implementations and loading these either into the host 
(software implementation) or hardware PE (hardware 
implementation) for execution. 

The specific division of functionality between AP and PE 
modules 1406 and 1408 in an application is left entirely up 
to the application designer. Preferably, the AP module 1406 
should be used to implement initialization and control, user 
interaction, exception handling, and infrequent processing 
of frames requiring special attention. The PE module 1408 
preferably should implement simple processing on frames 
(possibly including the reconstruction of higher-layer 
messages) requiring extremely fast execution. PE action 
code runs in a run-to-completion real-time environment 
without memory protection, similar to an interrupt handler 
in most conventional operating systems. Thus, functions 
requiring lengthy processing times should be avoided, or 
executed in the AP module 1406. In addition, other functions 
may be loaded into the PE to support actions, asynchronous 
execution, timing, or other processing (such as upcalls/ 
downcalls, below). All code loaded into the PE has access to 
the PE runtime environment, provided by the ASL. 

The upcall/downcall facility provides for communication 
between PE actions and AP functions. An application may 
use upcalls/downcalls for sharing information or signaling 
between the two modules. The programmer may use the 
facility to pass memory blocks, frame contents, or other 
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messages constructed by applications in a manner similar to 
asynchronous. rcmote.procedurc calls. _ _ 

2. Basic Building Blocks 

This section describes the C++ classes needed to develop 
an application for the NetBoost platform. Two fundamental 5 
classes are used to abstract the classification and handling of 
network frames: 

ACE, representing classification and action steps 

Target, representing possible frame destinations 
2.1 ACEs 10 

The ACE class (short for Action- Classification-Engine) 
abstracts a set of frame classification criteria and associated 
actions, upcall/downcall entrypoints, and targets. They are 
simplex: frame processing is uni-directional. An application 
may make use of cascaded ACEs to achieve serialization of 15 
frame processing. ACEs are local to an application. 

ACEs provide an abstraction of the execution of classi- 
fication rules, plus a container for holding the rules and 
actions. ACEs are instantiated on particular hardware 
resources either by direct control of the application or by the 20 
plumber application. 

An ACE 1500 is illustrated in FIG. 16. 

The ACE is the abstraction of frame classification rules 
1506 and associated actions 1508, destinations for processed 
frames, and downcall/upcall entrypoints. An application 25 
may employ several ACEs, which are executed in a serial 
fashion, possibly on different hardware processors. 

FIG. 16 illustrates an ACE with two targets 1502 and 
1504. The targets represent possible destinations for frames 
and are described in the following section. 30 

Frames arrive at an ACE from either a network interface 
or from an ACE. The ACE classifies the frame according its 
rules. A rule is a combination of a predicate and action. A 
rule is said to be "true" or to "evaluate true" or to be a 
"matching rule" if its predicate portion evaluates true in the 35 
Boolean sense for the current frame being processed. The 
action portion of each matching rule indicates what process- 
ing should take place. 

The application programmer specifies rule predicates 
within an ACE using Boolean operators, packet header 40 
fields, constants, set membership queries, and other opera- 
tions defined in the NetBoost Classification Language 
(NCL), a declarative language described in Section VII. A 
set of rules (an NCL program) may be loaded or unloaded 
from an ACE dynamically under application control. In 45 
certain embodiments, the application developer implements 
actions in a conventional high level language. Special exter- 
nal declaration statements in NCL indicate the names of 
actions supplied by the application developer to be called as 
the action portion for matching rules. 50 

Actions are function entry-points implemented according 
to the calling conventions of the C programming language 
(static member functions in C++ classes are also supported). 
The execution environment for actions includes a C and C++ 
runtime environment with restricted standard libraries 55 
appropriate to the PE execution environment. In addition to 
the C environment, the ASL library provides added func- 
tionality for developing network applications. The ASL 
provides support for handling many TCP/IP functions such 
as IP fragmentation and re- assembly, Network Address 60 
Translation (NAT), and TCP connection monitoring 
(including stream reconstruction). The ASL also provides 
support for encryption and basic system services (e.g. 
timers, memory management). 

During classification, rules are evaluated first-to-last. 65 
When a matching rule is encountered, its action executes and 
returns a value indicating whether it disposed of the frame. 
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Disposing of a frame corresponds to taking the final desired 
action on_the frame for a single, classification step (e.g.- 
dropping it, queueing it, or delivering it to a target). If an 
action executes but does not dispose of the current frame, it 
returns a code indicating the frame should undergo further 
rule evaluations in the current classification step. If any 
action disposes of the frame, the classification phase termi- 
nates. If all rules are evaluated without a disposing action, 
the frame is delivered to the default target of the ACE. 
2.2 Targets 

Targets specify possible destinations for frames (an ACE 
or network interface). A target is said to be bound to wither 
an ACE or network interface (in the outgoing direction), 
otherwise it is unbound. Frames delivered to unbound tar- 
gets are dropped. Target bindings are manipulated by a 
plumbing application in accordance with the present inven- 
tion. 

FIG. 17 shows a cascade of ACEs. ACEs use targets as 
frame destinations. Targets 1 and 2 (illustrated at 1602 and 
1604) are bound to ACEs 1 and 2 (illustrated at 1610 and 
1612), respectively. Target 3 (at 1606) is bound to a network 
interface (1620) in the outgoing direction. Processing occurs 
serially from left to right. Ovals indicate ACEs, hexagons 
indicate network interfaces. Outgoing arcs indicate bound 
targets. An ACE with multiple outgoing arcs indicates an 
ACE that performs a demultiplexing function: the set of 
outgoing arcs represent the set off all frame destinations in 
the ACE, across all actions. In this example, each ACE has 
a single destination (the default target). When several hard- 
ware resources are available for executing ACEs (e.g. in the 
case of the NetBoost hardware platform), ACEs may execute 
more efficiently (using pipelining). Note, however, that 
when one ACE has finished processing a frame, it is given 
to another ACE that may execute on the same hardware 
resource. 

3. Complex Configurations 

As described above, a single application may employ 
more than one ACE. Generally, processing bidirectional 
network data will require a minimum of two ACEs. Four 
ACEs may be a common configuration for a system pro- 
viding two network interfaces and an application wishing to 
install ACEs at the input and output for each interface (e.g. 
in the NetBoost hardware environment with one PE). 

FIG. 18 illustrates an application employing six ACEs 
1802, 1804, 1806, 1808, 1810 and 1812. Shaded circles 
represent targets. Two directions of processing are depicted, 
as well as an ACE with more than one output arc and an ACE 
with more than one input arc. The arcs represent possible 
destinations for frames. 

An ACE depicted with more than one outgoing arc may 
represent the processing of a single frame, or in certain 
circumstances, the replication (copying) of a frame to be 
sent to more than one downstream ACE simultaneously. 
Frame replication is used in implementing broadcast and 
multicast forwarding (e.g. in layer 2 bridging and IP mul- 
ticast forwarding). The interconnection of targets to down- 
stream objects is typically performed by the plumber appli- 
cation described in the next section. 

4. Software Architecture 

This section describes the major components comprising 
the NetBoost software implementation. The software archi- 
tecture provides for the execution of several applications 
performing frame-layer processing of network data, and 
includes user-level, kernel-level, and embedded processor- 
level components (for the hardware platform). The software 
architecture is illustrated FIG. 19. 

The layers of software comprising the overall architecture 
and described bottom-up. The first layer is the NetBoost 
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Policy Engine 2000 (PE). Each host system may be 
equipped with one or- more PEs. In systems equipped with 
NetBoost hardware PEs, each PE will be equipped with 
several frame classifiers and a processor responsible for 
executing action code. For systems lacking the hardware PE, 
all PE functionality is implemented in software. The PE 
includes a set of C++ library functions comprising the 
Action Services Library (ASL) which may be used by action 
code in ACE rules to perform messaging, timer-driven event 
dispatch, network packet reassembly or other processing. 

The PE interacts with the host system via a device driver 
2010 and ASL 2012 supplied by NetBoost. The device driver 
is responsible for supporting maintenance operations to 
NetBoost PE cards. In addition, this driver is responsible for 
making the network interfaces supplied on NetBoost PE 
cards available to the host system as standard network 
interfaces. Also, specialized kernel code is inserted into the 
host's protocol stack to intercept frames prior to receipt by 
the host protocol stack (incoming) or transmission by con- 
ventional network interface cards (outgoing). 

The Resolver 2008 is a user-level process started at boot 
time responsible for managing the status of all applications 
using the NetBoost facilities. In addition, it includes the 
NCL compiler and PE linker/loader. The process responds to 
requests from applications to set up ACEs, bind targets, and 
perform other maintenance operations on the NetBoost 
hardware or software-emulated PE. 

Hie Application Library 2002 (having application 1, 2 & 
3 at 2020, 2040, 2041) is a set of C++ classes providing the 
API to the NetBoost system. It allows for the creation and 
configuration of ACEs, binding of targets, passing of mes- 
sages to/from the PE, and the maintenance of the name-to- 
object bindings for objects which exist in both the AP and PE 
modules. 

The plumber 2014 is a management application used to 
set up or modify the bindings of every ACE in the system 
(across all applications). It provides a network administrator 
the ability to specify the serial order of frame processing by 
binding ACE targets to subsequent ACEs. The plumber is 
built using a client/server architecture, allowing for both 
local and remote access to specify configuration control. All 
remote access is authenticated and encrypted. 

VII. Classification Language 

The NetBoost Classification Language (NCL) is a 
declarative high level language for defining packet filters. 
The language has six primary constructs: protocol 
definitions, predicates, sets, set searches, rules and external 
actions. Protocol definitions are organized in an object- 
oriented fashion and describe the position of protocol header 
fields in packets. Predicates are Boolean functions on pro- 
tocol header fields and other predicates. Rules consist of a 
predicate/action pair having a predicate portion and an 
action portion where an action is invoked if its correspond- 
ing predicate is true. Actions refer to procedure entrypoints 
implemented external to the language. 

Individual packets are classified according to the predi- 
cate portions of the NCL rules. More than one rule may be 
true for any single packet classification. The action portion 
of rules with true predicates are invoked in the order the 
rules have been specified. Any of these actions invoked may 
indicate that no further actions are to be invoked. NCL 
provides a number of operators to access packet fields and 
execute comparisons of those fields. In addition, it provides 
a set abstraction, which can be used to determine contain- 
ment relationships between packets and groups of defined 
objects (e.g. determining if a particular packet belongs to 
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some TCP/IP flow or set of flows), providing the ability to 
keep persistent state in the classification process between - 
packets. 

Standard arithmetic, logical and bit-wise operators are 
5 supported and follow their equivalents in the C program- 
ming language. These operators provide operations on the 
fields of the protocols headers and result in scalar or Boolean 
values. An include directive allows for splitting NCL pro- 
grams into several files. 
10 1. Names and Data Types 

The following definitions in NCL constants have names: 
protocols, predicates, fields, sets, searches on sets, and rules 
(defined later subsequent sections). A name is formed using 
any combination of alphanumeric characters and under- 
15 scores except the first character must be an alphabetic 
character. Names are case sensitive. For example, 
set_tcp_udp 
IsIP 
isIPv6 

20 set__udp ports 

The above examples are all legal names. The following 
examples are all illegal names: 
6_byte_ip 
set_tcp+udp 
25 ip_src&dst 

The first is illegal because it starts with a numeric character, 
the other two are illegal because they contain operators. 

Protocol fields (see Section 6) are declared in byte- 
oriented units, and used in constructing protocols defini- 
30 tions. AH values are big-endian. Fields specify the location 
and size of portions of a packet header. All offsets are 
relative to a particular protocol. In this way it is possible to 
specify a particular header field without knowing the abso- 
lute offset of the any particular protocol header. Mask and 
35 shift operations support the accessing of non-byte-sized 
header fields. For example, 
dst {ip[16:4]} 
ver {(ip[0:l]&0xf0)»4} 
^ In the first line, the 4-byte field dst is specified as being at 
byte oflset 16 from the beginning of the IP protocol header. 
In the second example, the field ver is a half-byte sized field 
at the beginning of the IP header. 
2. Operators 

4S Arithmetic, logical and bit-wise binary operators are 
supported. Table 23 lists the arithmetic operators and group- 
ing operator supported: 



TABLE 23 



50 





Arithmetic operators 


Operator 


Description 


0 


Grouping operator 


+ 


Addition 




Subtraction 


« 


Logical left shift 


» 


Logical right shift 



The arithmetic operators result in scalar quantities, which 
60 are typically used for comparison. These operators may be 
used in field and predicate definitions. The shift operations 
do not support arithmetic shifts. The shift amount is a 
compile time constant. Multiplication, division and modulo 
operators are not supported. The addition and subtraction 
65 operations are not supported for fields greater than 4 bytes. 
Logical operators are supported that result in Boolean 
values. Table 24 provides the logical operators that are 
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supported by the language. 



TABLE 24 




Loaical operators 


Operator 


Description 


&& 


Logical AND 


II 


Logical OR 


! 


Not 


> 


Greater Than 




Greater Than or Equal To 


< 


Less Than 




Less Than or Equal To 




Equal To 


?_ 


Not Equal 


Bit -wise operators are provided for masking and setting of 


bits. The operators supported are as follows: 




TABLE 25 




Bit- wise operators 


Operators 


Description 


& 


Bit-wise AND 


1 


Bit-wise OR 




Bit-wise Exclusive OR 




Bit- wise One's Compliment 


The precedence and the associativity of all the operators 


listed above are shown in Table 26. The precedence is listed 


in decreasing order. 






TABLE 26 




Operator precedence 


Precedence 


Operators Associativity 


High 


( ) [ ] Left to right 




!- Right to left 




+- Left to right 




« » Left to right 




«=>>«=• Left to right 




--!- Left to right 




& Left to right 




Left to right 




| Left to right 




&& Left to right 


Low 


|| Left to right 



3. Field Formats 

The language supports several standard formats, and also 
domain specific formats, for constants, including the dotted- 
quad form for IP version 4 addresses and colon-separated 
hexadecimal for Ethernet and IP version 6 addresses, in 
addition to conventional decimal and hexadecimal con- 
stants. Standard hexadecimal constants are defined as they 
are in the C language, with a leading Ox prefix. 

For data smaller than 4 bytes in length, unsigned exten- 
sion to 4 bytes is performed automatically. A few examples 
are as shown below: 

TABLE 27 



Constant formats 

0x11223344 Hexadecimal form 
101.230.135.45 Dot separated IP address form 
ff:12:34:56:78:9a Colon separated MAC address form 
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4. Comments 

- C and C++ style comments are -supported. One syntax 
supports multiple lines, the other supports comments termi- 
nating with a newline. The syntax for the first form follows 

5 the C language comment syntax using /* and */ to demark 
the start and eod of a comment, respectively. The syntax for 
the second form follows the C++ comment syntax, using // 
to indicate the start of the comment. Such comments end at 
the end of the line. Nesting of comments is not allowed in 

10 the case of the first form. In the second case, everything is 
discarded to the end of the line, so nesting of the second 
form is allowed. Comments can occur anywhere in the 
program. A few examples of comments are shown below, 

15 

Diagram 1: Legal comments 

/* Comment in a single line V 

// Second form of the comment: compiler ignores to end-of-line 
/* Comments across multiple line 

second line 
third line */ 

// Legal comment // still ignored to end-of-line 
/* First form // Second form, but OK 
V 

25 

The examples above are all legal. The examples shown in 
Diagram 11 (below) are illegal. 



30 : 

Diagram 2: Illegal comments 

/* space V 

/ new-line 
* Testing *; 

r Nesting /* Second level */ 

/ / space 
/ new-line 
/ 

// /• Nesting 

7 " 

40 

The first comment is illegal because of the space between / 
and *, and the second one because of the new-line. The third 
is illegal because of nesting. The fourth is illegal because of 

45 the space between the V chars and the next one because of 
the new-line. The last one is illegal because the /* is ignored, 
causing the */ to be in error of nesting of the first form of the 
comment in the second form. 

5. Constant Definitions and Include Directives 

50 The language provides user-definable symbolic constants. 
The syntax for the definition is the keyword #define, then the 
name followed by the constant. No spaces are allowed 
between # and define. The constant can be in any of the 
forms described in the next subsection of this patent appli- 

55 cation. The definition can start at the beginning of a line or 
any other location on a line as long as the preceding 
characters are either spaces or tabs. For example, 



Diagram 3: sample of constant definition usage 

#define TELNET_PORT__NUM 23 // Port number for telent 

#define IP_ADDR 10.4.7.18 
#define MAC_ADDR cd.eei0.34.74.93 

65 

The language provides the ability to include files within the 
compilation unit so that pre-existing code can be reused. The 
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keyword #include is used, followed by the filename 
enclosed in double quotes. The # must. start on a new-line, 
but may have spaces immediately preceding the keyword. 
No space are allowed between # and the include. The 
filename is any legal filename supported by the host. For 
example, 



Diagram 4: Sample include directives 

^include "tnyproto.def ' // Could be protocol definitions 

#include "stdrules.ruT // Some standard rules 

#include "newproto.def ' /" New protocol definitions */ 



6. Protocol Definitions 

NCL provides a convenient method for describing the 
relationship between multiple protocols and the header fields 
they contain. A protocol defines fields within a protocol 
header, intrinsics (built-in functions helpful in processing 
headers and fields), predicates (Boolean functions on fields 
and other predicates), and the demultiplexing method to 
high-layer protocols. The keyword protocol identifies a 
protocol definition and its name. The name may later be 
referenced as a Boolean value which evaluates true if the 
protocol is activated (see 6.2). The declarations for fields, 
intrinsics and demultiplexing are contained in a protocol 
definition as illustrated below. 

6.1 Fields 

Fields within the protocol are declared by specifying a 
field name followed by the offset and field length in bytes. 
Offsets are always defined relative to a protocol. The base 
offset is specified by the protocol name, followed by colon 
separated offset and size enclosed in square brackets. This 
syntax is as shown below: 



field_name { protocol_name[oBset:size] } 
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Diagram 5: Sample intrinsic declarations 

^ protocol foo { 

field defs 

intrinsic chksumvalid { } 

} 

intrinsic now 



10 

The first example indicates chksumvalid intrinsic is associ- 
ated with the protocol foo. Thus, the expression foo.chk- 
sumvalid could be used in the creation of predicates or 
expressions defined later. The second example indicates a 
global intrinsic called now that may be used anywhere 
within the program. Intrinsics can return Boolean and scalar 
values. 

In a protocol definition, predicates are used to define 
frequently used Boolean results from the fields within the 
protocol being defined. They are identified by the keyword 
20 predicate. Predicates are described in section 7. 
6.3 Demux 

The keyword demux in each protocol statement indicates 
how demultiplexing should be performed to higher-layer 
protocols. In effect, it indicates which subsequent protocol is 

25 "activated", as a function of fields and predicates defined 
within the current set of activated protocols. 

Evaluation of the Boolean expressions within a protocol 
demux statement determines which protocol is activated 
next. Within a demux statement, the first expression which 

30 evaluates to true indicates that the associated protocol is to 
be activated at a specified offset relative to the first byte of 
the present protocol. The starting offset of the protocol to be 
activated is specified using the keyword at. A default pro- 
tocol may be specified using the keyword default. The first 

35 case of the demux to evaluate true indicates which protocol 
is activated next. All others are ignored. The syntax for the 
demux is as follows: 



Diagram 6: Demux syntax sample 



demux { 

boolean_exp { protocoLname at offset } 
default { protocol_name at offset } 
45 > 



Fields may be defined using a combination of byte ranges 
within the protocol header and shift/mask or grouping 
operations. The field definitions act as access methods to the 
areas within in the protocol header or payload. For example, 
fields within a protocol named MyProto might be specified 
as follows: 



dest_addr { MyProto[6:4] } 

bit_flags { (MyProto[10:2] & OxOffO) » 8 } 



In the first example, field dest_addr is declared as a field at 
offset 6 bytes from the start of the protocol MyProto and 4 
bytes in size. In the second example, the field bit_flags is a 
bit field because it crosses a byte boundary, two bytes are 
used in conjunction with a mask and right shift operation to 
get the field value. 
6.2 Intrinsics 

Intrinsics are functions listed in a protocol statement, but 
implemented internally. Compiler-provided intrinsic are 
declared in the protocol definition (for consistency) using 
the keyword intrinsic followed by the intrinsic name. Intrin- 
sics provide convenient or highly optimized functions that 
are not easily expressed using the standard language con- 
structs. One such mtrinsics is the IP checksum. Intrinsics 
may be declared within the scope of a protocol definition or 
outside, as in the following examples: 



Diagram 7 shows an example of the demux declaration. 



7: Sample protocol demux 



{ proto_a at offsct_a } 
{ proto„b at offset_b } 
{ proto_defeult at offset__default } 



In the above example, protocol proto_a is "activated" at 
offset offset__a if the expression length equals ten. Protocol 

60 proto__b is activated at offset offset_b if flags is true, 
predicate_x is true and length is not equal to 10.predicate__x 
is a pre-defined Boolean expression. The default protocol is 
proto_default, which is defined here so that packets not 
matching the predefined criteria can be processed. The fields 

65 and predicates in a protocol are accessed by specifying the 
protocol and the field or predicate separated by the dot 
operator. This hierarchical naming model facilitates easy 



Diagram 

demux { 

{length - 10} 

{flags && predicate.^:} 

default 
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extension to new protocols. Consider the IP protocol operator. Following the name is the definition enclosed in 
example shown below. - -- -- delimiters as illustrated below: _ _ _ _ - 



Diagram 8: Protocol Sample: IP 



Diagram 10: Sample protocol extension 



protocol ip { 

vers { (ip[0:l] & QxKl) » 4 } 

hlength { ip[0:l] «fe QxOf } 

hlength_J> { hlength « 2 } 

tos { ip(l:l] } 

length { ip(l:2] } 

id { ip(4:2] } 

flags { Cip[6:l] & OxaO) » 5 } 

frago&et { ip{6:2] & Qxlfff } 

ttl { ip(8:l] } 

proto { ip[9:l] } 

chksum { ip[10:2] } 

sre { ip[12:4] } 

est { ip{16:4] } 

intrinsic chksumvalid {} 

intrinsic genchksum {} 

predicate beast { dst — 255.255.255.255 } 

predicate mcast { (dst & OxffiOOOOOO) OxeOOOOOOO } 

predicate frag { fragoffset !- 0 || (frags & 2) I- 0} 

demux { 

( proto mm 6) { tcp at hlength_b } 
( proto — 17 ) { udp at hlength_b } 
( proto « 1 ) { icmp at hlength_b } 
( proto == 2 ) { igmp at htength_b } 
default { unknowIP at hlcngth_b } 

} 



10 



15 



20 



} 



Here, ip is the protocol name being defined. The protocol 30 
definition includes a number of fields which correspond to 
portions of the IP header comprising one or more bytes. The 
fields vers, hlength, flags and fragoffset have special opera- 
tions that extract certain bits from the IP header. hlength__b 
holds the length of the header in bytes computed using the 35 
hlength field (which is in units of 32-bit words). 

bcast,mcast, and frag are predicates which may be useful 
in defining other rules or predicates. Predicates are defined 
in Section 7. 

This protocol demuxes into four other protocols, exclud- 40 
ing the default, under different conditions. In this example, 
the demultiplexing key is the protocol type specified by the 
value of the IP proto field. All the protocols are activated at 
offset hlength_b relative to the start of the IP header. 

When a protocol is activated due to the processing of a 45 
lower-layer demux statement, the activated protocors name 
becomes a Boolean that evaluates true (it is otherwise false). 
Thus, if the IP protocol is activated, the expression ip will 
evaluate to a true Boolean expression. The fields and predi- 
cates in a protocol are accessed by specifying the protocol 50 
and the field, predicate or intrinsic separated by the dot 
operator. For example: 



Diagram 9: Sample references 
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ip. Length 
ip. beast 

ip. chksumvalid 



Users can provide additional declarations for new fields, 
predicates and demux cases by extending previously-defined 
protocol elements. Any name conflicts will be resolved by 
using the newest definitions. This allows user-provided 
definitions to override system-supplied definitions updates 65 
and migration. Tne syntax for extensions is the protocol 
name followed by the new element separated by the dot (.) 



xx.newfield { xx[10:4] } 

predicate xx.newprcd { xx[8:2] != 10 } 

xx.demux { 

(xx(6:2] — 5 ) { newproto at 20 } 

} 



In the first example, a new field called newfield is 
declared for the protocol xx. In the second, a new predicate 
called newpred is defined for the protocol xx. In the third 
example, a new higher-layer protocol newproto is declared 
as a demultiplexing for the protocol xx. The root of the 
protocol hierarchy is the reserved protocol frame, which 
refers to the received data from the link-layer. The redefi- 
nition of the protocol frame is not allowed for any protocol 
definitions, but new protocol demux operations can be added 
to it. 

The intrinsics are listed in Table 28: 
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TABLE 28 



List of intrinsics 



Intrinsic Name Functionality 



ip. chksumvalid Check the validity of the ip header checksum, return 
boolean value 

tcp. chksumvalid Check the validity of the tcp pseudo checksum, return 
boolean value 

udp. chksumvalid Check the validity of udp pseudo checksum, return 
boolean value 



7. Predicates 

Predicates are named Boolean expressions that use pro- 
tocol header fields, other Boolean expressions, and 
previously-defined predicates as operands. The syntax for 
predicates is as follows: 

predicate predicate_name {boolean_expression} 
For example, 

predicate isTcpSyn {tcp && (tcp.flags & 0x02) !=0} 
predicate isNewTelnet {isTopSyn && (tcp.dport=23)} 
In the second example, the predicate isTcpSyn is used in the 
expression to evaluate the predicate isNewTelnet. 

8. Sets 

The language supports the notion of sets and named 
searches on sets, which can be used to efficiendy check 
whether a packet should be considered a member of some 
application-defined equivalence class. Using sets, classifi- 
cation rules requiring persistent state may be constructed. 
The classification language only supports the evaluation of 
set membership; modification to the contents of the sets are 
handled exclusively by actions in conjunction with the ASL. 
A named search defines a particular search on a set and its 
name may be used as a Boolean variable in subsequent 
Boolean expressions. Named searches are used to tie pre- 
computed lookup results calculated in the classification 
phase to actions executing in the action phase. 

A set is defined using the keyword set followed by an 
identifier specifying the name of the set. The number of keys 
for any search on the set is specified following the name, 
between < and >. A set definition may optionally include a 
hint as to the expected number of members of the set, 
specified using the keyword size__hint. Tne syntax is as 
follows: 
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Diagram 11: Declaring a set 



set set_name < nkeys > { 

size_hint { expected_population } 
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UDP/IP. The four named searches provide checks as to 
whetherdifferent TCP or UDP source or destination port - 
numbers are present in the set. The results of named searches 
may be used as Boolean values in expressions, as illustrated 
below: 



The size_hint does not place a strict limit on the population 
of the set, but as the set size grows beyond the hint value, the 10 
search time may slowly increase. 

Predicates and rules may perform names searches (see the 
following section for a discussion of rules). Names searches 
are specified using the keyword search followed by the 
search name and search keys. The search name consists of 15 
two parts: the name of the set to search, and the name of the 
search being defined. The keys may refer to arbitrary 
expressions, but typically refer to fields in protocols. The 
number of keys defined in the named search must match the 
number of keys defined for the set. The named search may 20 
be used in subsequent predicates as a Boolean value, where 
"true" indicates a record is present in the associated set with 
the specified keys. An optional Boolean expression may be 
included in a named search using the requires keyword. If 
the Boolean expression fails to evaluate true, the search 25 
result is always "false". The syntax for named searches is as 
follows: 



Diagram 12: Named search 

search set__name.search_jiame (keyl, key2) { 
requires { boolean_expression } 

} 



Consider the following example defining a set of transport- 
layer protocol ports (tcp or udp): 



Diagram 13: Sharing a set definition 

#definc MAX_TCP_UDP_PORTS_SET_SZ 200 
/* TUFORTS: a set of TCP or UDP ports 7 
set tuports<l> { 

size_hint { MAX_TCP_UDP_PORTS_SET_SZ } 

} 

search tuports.tcp_sport (tcp .sport) 
search tuports. tcp_dport (tcp.dport) 
search tuports.udp_sport (tcp.sport) 
search tuports.udp_dport (tcp.dport) 



This example illustrates how one set may be used by 
multiple searches. The set tuports might contain a collection 
of port numbers of interest for either protocol, TCP/IP or 



Diagram 14: Using shared sets 

predicate tcp_sport_Jn {tuports.tcp^sport} 

prdeicate tcp_port_jn {tuports. tcp_s port && tuports.tcp_dpoTt } 

predicate udp sdports in { 

tuports.udp__sport j| (tuports.udp_dport 

} 



In the first example, a predicate tcp_sport_in is defined 
to be the Boolean result of the named search tuports.tcp_ 
sport, which determines whether or not the tcp.sport field 
(source port) of a TCP segment is in the set tuports. In the 
second example, both the source and destination ports of the 
TCP protocol header are searched using named searches. In 
the third case, membership of either the source or destination 
ports of a UDP datagram in the set is determined. 

9. Rules and Actions 

Rules are a named combination of a predicate and action. 
They are defined using the keyword rule. The predicate 
portion is a Boolean expression consisting of any combina- 
tion of individual Boolean subexpressions or other predicate 
names. The Boolean value of a predicate name corresponds 
to the Boolean value of its associated predicate portion. The 
action portion specifies the name of the action which is to be 
invoked when the predicate portion evaluates "true" for the 
current frame. Actions are implemented external to the 
classifier and supplied by application developers. Arguments 
can be specified for the action function and may include 
predicates, names searches on sets, or results of intrinsic 
functions. The following illustrates the syntax: 



Diagram 15: rule syntax 

rule rule__name { predicate } { 

external_actton__func {argl, arg2, . . .} 

} 



The argument list defines the values passed to the action 
code executed externally to NCL. An arbitrary number of 
arguments are supported. 



35 



40 



45 
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Diagram 16: Telnet/FTP example 

set set_ip__tcp_ports <3> { 
size_hint { 100 } 

} 

set set_ip_udp ports <3> { 

size_hint { 100 } 

} 

search set_Jp_tcp_ports.tcp_dport { ip.src, Lp.dst, tcp.dport ) { 
requires (ip && tcp) 

search set_Lp_udp_ports.udp_dport ( ip.src, ip.dst, udp.dport ) { 
requires (ip && udp) 

} 

predicate ip Valid { ip && ip.chksum valid && (ip.hlen > 5) && 
(ip.ver — 4) } 

predicate newtelnet {(tcp.flags & 0x02) && (tcp.dport =» 23) } 

predicate tftp { ( udp.dport — 21) && set_ip_udp_ports.udp_ports } 

rule telnetNewCon { ip Valid && newtelent && seLjp_tcp_ports.tcp__dport } 

{ start_telent ( set_Lp_tcp_ports.tcp_dport) } 
rule tftppkt (ip\felid && tftp ) 

{ is tftp pkt ( udp.dport ) } 

rule addnewtelnet { newtelnet } 
{ a dd_to top_pkt cound Q } 



In the above example, two sets are defined. One contains 
source and destination IP addresses, plus TCP ports. The 
other set contains IP addresses and UDP ports. Two named 
searches are defined. The first search uses the IP source and 
destination addresses and the TCP destination port number 
as keys. The second search uses the IP source and destina- 
tion addresses and UDP destination port as keys. The 
predicate ip Valid checks to make sure the packet is an IP 
packet with valid checksum, has a header of acceptable size, 
and is IP version 4. The predicate newtelnet determines if the 
current TCP segment is a SYN packet destined for a telnet 
port. The predicate tftp determines if the UDP destination 
port corresponds to the TFTP port number and the combi- 
nation of IP source and destination addresses and destination 
UDP port number is in the set ip_udp_ports. The rule 
telnetNewCon determines if the current segment is a new 
telnet connection, and specifies that the associated external 
function start_telnet will be invoked when this rule is true. 
The function takes the search result as argument. The rule 
tftppkt checks whether the packet belongs to a TFTP asso- 
ciation. If so, the associated action is__tftp__pkt will be 
invoked with udp.dport as the argument. The third checks if 
the current segment is a new telnet connection and defines 
the associated action function add_to_tcp_pkt_count. 

10. With Clauses 

A with clause is a special directive providing for condi- 
tional execution of a group of rules or predicates. The syntax 
is as follows: 



Diagram 17: With clause syntax sample 

with boolean_expresston { 

predicate pred _namc ( any_boolean_exp ) 

rule rule_name ( any_boolean_exp ) ( action__reference ) 

} 



30 If the Boolean expression in the with clause evaluates 
false, all the enclosed predicates and rules evaluate false. For 
example, if we want to evaluate the validity of an IP 
datagram and use it in a set of predicates and rules, these can 
be encapsulated using the with clause and a conditional, 

35 which could be the checksum of the IP header. Nested with 
clauses are allowed, as illustrated in the following example: 



Diagram 18: Nested with clauses 

predicate tcp Valid { top && tcp.chksumaKd } 

#define TELNET 23 // port number for telnet 

45 

with ip Valid { 

predicate tftp { (udp.dport — 21) && 

ip_udp_ports.udp_dport } 
with top Valid { /* Nested with */ 
50 predicate newtelnet { (tcp.flags &. 0x02) && 

tcp.dport == TELNET } 
rule telnetNewCon { newtelnet && ip tcp ports.tcp dport } 
{ start_telnet { ip_tcp_sporttcp_dport} } 

» } 

rule tftppkt { tftp && ip_udp_ports.udp_dport } 
{ is_tftp pkt { udp.dport } } 

} 

60 

11. Protocol Definitions for TCP/IP 

65 

The following NCL definitions are used for processing of 
TCP/IP and related protocols. 
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/•*•*•**••******,»» ****FRAME (bast unit)****"****** 

protocol frame { 

// status words written by NetBoost Ethernet MACs 

rxstatus { frame[Qxl80:4] } // receive status 

rxetalnp { frame[0xl84:4] } // receive time stamp 

txstatus { frame[0xl88:4] } // xmit status (if sent out) 



txstamp 

predicate rxerror 
length 
source 
offset 

predicate txok 
demux { 
rxerror 



{ frame[Qxl8C:4] } // xmit time stamp (if sent) 
{ (rxstatus & 0x80000000) } 
{ (rxstatus & OxO7FFOOO0) » 16 } // frame len 
{ (rxstatus & OxOOOOOFOO) » 8 } // hardware origin 
{ (rxstatus & OxOOOOOOFF) } // start of frame 
{ (txstatus & 0x80000000) !« 0 } // tx success 



} 



} 



{ framc_bad at 0 } 
// source 0: NetBoost onboard MAC A ethernet packet 
// source 1: NetBoost onboard MAC B ethernet packet 
// source 2: Other rxstatus- encodable ethernet packet 
(source < 3) { ether at 0x180 + offset } 
default { frame_bad at 0 } 



protocol frame_bad { 
} 

^««*«« ************** **»****** E t'hernET* 
#define ETHER_IPTYPEOx0800 
#define ETHER_ARPTYPE 0x0806 
#define ETHER_RARPTYPE 0x8035 
protocol ether { 



// source ethernet address 
// destination ethernet address 

// length or type, depends on eticap 
// SNAP code if present 
// type for 8023 encaps 



dst { ethcr[0:6] } 

sre { ether(6:6] } 

typelen { ether[12:2] } 

snap { ether[14:6] } 

type { ether[20:2] } 

// We are only interested in a specific subset of the possible 
// 802.3 encapsulations; specifially, those where the 802.2 LLC area 
// contains DSAP-OxAA, SSAP-OxAA, and CNTL-Qx03; followed by 
// the 802.2 SNAP ar3ca contains the ORG code 0x000000. In this 
// case, the 7802.2 SNAP "type" field contains one of our ETHER 
// type values defined above. 

predicate issnap { (typelen <= 1500) && (snap == OxAAAA03000000) } 
offset { 14 + (issnap « 3) } 

demux { 

typelen »= ETHER__ARPTYPE { arp at offset } 

typelen — ETHER__RARPTYPE { arp at offset } 

typelen «- ETHER_IPTYPE { ip at offect } 

issnap && (type == ETHER__ARPTYPE) { arp at offset } 
issnap && (type =- ETHER_RARPTYPE) { arp at offset } 
issnap && (type == ETHER_IFTYPE) { ip at offset } 

default { ether_bad at 0 } 

} 



} 

protocol ether_bad { 
} 

^*** ************** PROTOCOL* 

#define ARPHRD_ETHER 1 

#define ARPHRD_FRELAY 15 

#define ARPOP_REQUEST 1 

#define ARPOP_REPLY 2 

#define ARPOP_REVREQUEST 3 

#define ARPOP_REVREPLY 4 

#defineARPOP_INVREQUEST 8 

#define ARPOP_INVREPLY 9 
protocol arp { 



************* 



V 



/* ethernet hardware format */ 
/* frame relay hardware format */ 
/* request to resolve address */ 
r response to previous request */ 
/* request protocol address given hardware */ 
/* response giving protocol address */ 
/* request to identify peer "7 
/* response identifying peer */ 



htype 
ptype 
hsize 
psize 
op 

varhdr 
predicate 

demux { 
ethip4 
default 

} 



{ *rp[0:2] } 
{ arp[2:2] } 
{ arp[4:2] } 
{ arp[5:l] } 

{ arp[6:2] } 
{«} 

ethip4 { (op <- ARPOP„REVREPLY) && (htype ~ ARPHRD_ETHER) && 
(ptype « ETHER_IPTYPE) && (hsize — 6) && (psize =• 4) } 

{ cthcr_Jp4_arp at varhdr } 
{ unimpl_arp at 0 } 



} 

protocol unirnpl__arp { 
} 

protocol ether ip4 arp { 
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-continued 



shaddr { cther_jp4_arp[0:6] } 

spaddr { ether_ip4_arp[6":4] } 

thaddi { ether_ip4_arp[10:6] } 

tpaddi { ether_ip4_arp[16:4] } 



} 

protocol ip { 
verhl 



tos 

length 

id 

ffo 



ttl 

proto 
cksutn 
arc 
dst 



-IPv4* 



{ ip[0:l] } 

veT 

hi 

hlcn 

{ip[l:l]} 
{ } 



{ (veihl & OxfD) » 4 } 
{ (veihl & OxOf) } 
{ hi « 2 } 



flags 



// varible length options start at offset 20 



{ ip[4:2] } 
{ ip[6:2] } 

{ (ffo & OxeOOO) » 13 } 

{ (ffo & OxlffO } 

{ ip[S:l]} . 

{ ip[9:l] } 

{ ip[10=2] } 

{ ip[12:4] } 

{ ip[36:4] } 



predicate 
predicate 
predicate 
predicate 
predicate 
predicate 
predicate 
predicate 
intrinsic 
predicate 
predicate 



dbcast 
sbcast 
smcast 
dmcast 
dontfr 
morefr 
isfrag 



{ dst 255.255.255.255 } 
{ sre = 255.255.255.255 } 
{ (sre & OxFOOOOOOO) — OxEOOOOOOO } 
{ (dst & OxFOOOOOOO) = OxEOOOOOOO } 



//"do not fragment this packet" 
//"not last frag in datagram" 



{ (flags & 2) 1= } 
{ (flags & 1) != } 
{ more || fragoff } 
options { hlen > 20 } 
chksumvalid { } 

okwlen { (Erame. length - ether, offset) >«. length } 



invalid 



badsrc 



{ (ver !- 4) || (hlen < 20) | 

((framclength - ether, offset) < length) | 
(length < hlen) || ! chksumvalid } 
{ sbcast || smcast } 



predicate 
demux { 

// Demux expressions ar evaluated in order, and the 
// first one that matches causes a demux to the protocol; 
// once one matches, no further checks axe made, so the 
// cases do not have to be precisely mutually exclusive. 



invalid 
badsrc 
(proto == 1) 
(proto == 2) 
(proto 6) 
(proto 17) 
default 



} 



} 



{ ip_bad at 0 } 
{ ip_badsrc at 0 } 
{ icmp at hlen } 
{ igmp at hlen } 
{ tcp at hlen } 
{ udp at hlen } 

{ ip_unknown_transport at hlen } 



protocol ip_bad { 
} 

protocol ip_badsrc { 
} 

protocol ip_unknown__transport { 
} 

r •••••udp*' 

protocol udp { 



sport 

dport 

length 

cksum 

intrinsic 

predicate 



{ udp[0:2] } 
{ udp[2:2] } 
{ udp[4:2] } 

{ udp[6:2] } 
chksumvalid 
valid 



{ } /" undefined if a frag */ 
{ ip.isfrag || chksumvalid } 



•TCP* 



******** 



protocol tcp { 
sport 
dport 
seq 
ack 
hlf 



win 

cksum 

urp 



{ tcp[0=2] } 
{ tcp[*2j } 

{ lcp[4:4] } 

{ MS*] } 

{ tcp[12:2] } 
hi { (hlf & OxfOOO) » 12 } 

hlen { hi « 2 } 
flags { (hlf & 0x003f) } 

{ tcp[14:2] } 

{ Ml**] } 

{ tcp[18:2] } 
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-continued 



} 

/•• 



intrinsic 
predicate 
predicate 

********** 



chksumvalid 
valid 

opt_present 



{ } /* undefined if [P Fragment */ 

{ Lp.isfrag || ((hleti >= 20) && chksumvalid) } 

{ hlen > 20 } 



« ****** tat*** 



protocol icmp { 
type 
code 
cksum 

} 

^**« ************* 

protocol igmp { 
vertype 



{ icmp[0:l] } 
{ icmp[l:l] } 

{ icmp[2:2j } 

'****IGMP*** ,M 



************ 



{ igmp[0:l] } 
ver { (vertype & OxfO) « 4 

type { (vertype & OxOf) } 

reserved { igmp[l:l] } 

cksum { igmp[2:2] } 

group { igmp[4:4] } 



VIII. ASL 

The Application Services Library (ASL) provides a set of 
library functions available to action code that are useful for 
packet processing- The complete environment available to 
action code includes: the ASL; a restricted C/C++ library 
and runtime environment; one or more domain specific 
extensions such as TCP/IP. 

The Restricted C/C++ Libraries and Runtime Environ- 
ment 

Action code may be implemented in either the ANSI C or 
C++ programming languages. A library supporting most of 
the functions defined in the ANSI C and C++ libraries is 
provided. These libraries are customized for the NetBoost 
PE hardware environment, and as such differ slightly from 
their equivalents in a standard host operating system. Most 
notably, file operations are restricted to the standard error 
and output streams (which are mapped into upcalls). 

In addition to the C and C++ libraries available to action 
code, NetBoost supplies a specialized C and C++ runtime 
initialization object module which sets up the C and C++ 
run-time environments by initializing the set of environment 
variables and, in the case of C++, executing constructors for 
static objects. 

L ASL Functions 

The ASL contains class definitions of potential use to any 
action code executing in the PE. It includes memory 
allocation, management of API objects (ACEs, targets), 
upcall/downcall support, set manipulation, timers, and a 
namespace support facility. The components comprising the 
ASL library are as follows: 

Basic Scalar Types 

The library contains basic type definitions that include the 
number of bits represented. These include int8 (8 bit 
integers), intl6 (16 bit integers), int32 (32 bit integers), and 
int64 (64 bit integers). In addition, unsigned values (unit8, 
unitl6, unit32, unit64) are also supported. 

Special Endian-Sensitive Scalar Types 

The ASL is commonly used for manipulating the contents 
of packets which are generally in network byte order. The 
ASL provides type definitions similar to the basic scalar 
types, but which represent data in network byte order. Types 
in network byte order as declared in the same fashion as the 
basic scalar types but with a leading n prefix (e.g. nuintl6 



20 

refers to an unsigned 16 bit quantity in network byte order). 
The following functions are used to convert between the 
basic types (host order) and the network order types: 

25 



uint32 ntohl(nuint32 n); // network to host (32 bit) 
uintl6 ntohs(nuintl6 n); // network to host (16 bit) 
auint32 htonl(uint32 h); // host to network (32 bit) 
nuintl6 htons(uintl6 h); // host to network (16 bit) 
30 

Macros and Classes for Handling Errors and Exceptions 
in the ASL 

The ASL contains a number of C/C++ macro definitions 
35 used to aid in debugging and code development (and mark 
fatal error conditions). These are listed below: 

ASSERT Macros (asserts boolean expression, halts on 
failure) 

^ CHECK Macros (asserts boolean, returns from current 
real-time loop on failure) 
STUB Macros (gives message, C++ file name and line 
number) 

SHO Macros (used to monitor value of a variable/ 
45 expression during execution) 

Exceptions 

The ASL contains a number of functions available for use 
as exception handlers. Exceptions are a programming con- 
50 struct used to delivery error information up the call stack. 
The following functions are provided for handling excep- 
tions: 

NB actio n„err and NBaction__warn functions to be 
55 invoked when exceptions are thrown. 

OnError class, used to invoke functions during exception 
handling, mostly for debugger breakpoints. 

ACE support 

60 Ace objects in the ASL contain the per-Ace state infor- 
mation. To facilitate common operations, the base Ace class 
pass and drop targets are provided by the base class and built 
when an Ace instance is constructed. If no write action is 
taken on a buffer that arrives at the Ace (i.e. none of the 

65 actions of matching rules indicates it took ownership), the 
buffer is sent to the pass target. The pass and drop functions 
(i.e. target take functions, below) may be used directly as 
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actions within the NCL application description, or they may 
be called by other actionsr Member functions of the-Ace 
class include: pass( ), drop( ), enaRule( ) — enable a rule, 
disRule( ) — disable a rule. 



ACTNF do_mcast (Buffer *buf, ExAce *acc) { 
ace->mcast_ct ++; 

. 5 cou t « aoe->QameO « « ace->mcast_ct « endl; 

Action support: rctum ace _ >drap m . 

} 



The init_actions( ) call is the primary entry point into the 
application's Action code. It is used by the ASL startup code 

to initialize the PE portion of the NetworkApplication.lt is Thus, the Buffer* and Ex Ace* types are passed to the 
responsible for constructing an Ace object of the proper 10 handler * In this case > ExAce * derived from the base Ace 
class, and typically does nothing else. Example syntax: c ass * 

IN1TF init_actions (void* id, char* name, Image* obj) 

{ 
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include "NBaction/NBaction.h" 



return new ExampleAce (id, name, obj); class ExAce : public Ace { 

» public: 

I ExAce(Moduletd id, char *name, Image *obj) 

The function should return a pointer to an object subclassed ; Ace(id, name, obj), mcast_ct(0) { } 

from the Ace class, or a NULL pointer if an Ace could not y mt mca5t - ct; 

be constructed. Throwing an NBaction_err or NBactiou__ 20 I ^ rrF i n it_actions(void *id, char 'name, Image *obj) { 

warn exception may also be appropriate and will be caught return new ExAce(id, name, obj); 

by the initialization code. Error conditions will be reported | 

back to the Resolver as a failure to create the Ace. 

Return Values from Action Code/Handlers ^ Bvftev Management (Buffer class) 

When a rule's action portion is invoked because the rule ^ basic ^ of pr0C essing in the ASL is the Buffer. All 

predication portion evaluation true, the action function must data rece i ve d from the network is received in buffers, and all 

return a code indicating how processing should proceed. The data to be transmitted must be properly formatted into 

action may return a code indicating it has disposed of the buffers. Buffers are reference-counted. Contents are types 

frame (ending the classification phase), or it may indicate it 3Q (more specifically, the type of the first header has a certain 

did not dispose of the frame, and further classification (rule type [an integer/enumerated type]). Member functions of the 

evaluations) should continue. A final option available is for Buffer class support common trimming operations (trim 

the action to return a defer code, indicating that it wishes to head, trim tail) plus additions (prepend and append date), 

modify a frame, but that the frame is in use elsewhere. The Buffers are assigned a time stamp upon arrival and departure 

return values are defined as C/C++ pre-processor definitions: 35 (if they are transmitted). The member function rxTime( ) 

#define RULE DONE returns receipt time stamp of the frame contained in the 

a *■ u ^ nVrTc nnMc . * • * buffer. The txTime( ) gives transmission complete time 

Actions should return RULE DONE to terminate pro- , £ , „. «l r + * ■ u u 

- . , — . lt _. L 4 stamp of the buffer if the frame it contains has been 

cessme of rules and actions within the context oi the . ... , „ , , .... i ' u *. »• j 

& A ». . L « i_ t_ , transmitted. Several additional member functions and opera- 

ciirrent Ace; for instance when a buffer has been sent ^ m {Q& Qew( buffer from pool 

to a target, or stored for later processing. 40 structure ( ^ below)j he a d erf*ase( )-location of first net- 

#define RULE_CONT : . . work header, headerOfifeet( preference to byte offset from 

Actions should return RULE_CONT if they have merely start of storage to first network header, packetSize( 

observed the buffer and wish for additional rules and ) — number of bytes in frame, headerType( ) — type of first 

actions within the context of the current ace to be 45 header, packetPadHeadSize( ) — free space before net 

processed. packet, packetPadTailSize( ) — free space after net packet, 

#define RULE_DEFER . . . prepend( )— add data to beginning, append( )— add data to 

Actions should return RULE_DEFER if they wish to end > trim_head( )-remove data from head trim_tail( 

modify a packet within a buffer but the buffer notes that Remove data from end, {rx,tx}Time( )-see above, next( 

the packet is currently busy elsewhere. 50 Reference to next buffer on chain, incref( )-bump ref- 

Predefined Actions erence count, decref( ) — decrement reference count, busy( 

The common cases of disposing of a frame by either >-mdicates buffer being processed, log( )— allows for add- 

dropping it or sending it on to the next classification entity in S info the transaction log' of a buffer which can indicate 

for processing is supported by two helper functions available wtial has P rocessed ll - 

to NCL code and result in calling the functions ACE::Pass( 55 Targets 

) or Ace::drop( ) within the ASL: action_pass (predefined Target objects within an Ace indicate the next hardware or 

action), passes frame to 'pass target*, always return RULE_ software resource that will classify a buffer along a selected 

DONE actioQ_drop (predefined action), passes frame to path. Targets are bound to another Ace within the same 

'drop target* , always returns RULE_DONE application, an Ace within a different application, or a built 

User-Defined Actions eo in resource such as decryption. Bindings for Targets are set 

Most often, user-defined actions are used in an Ace. Such up by the plumber (see above). The class includes the 

actions are implemented with the following calling struc- member function take( ) which sends a buffer to the next 

hire. downstream entity for classification. 

The ACTNF return type is used to set up linkage. Action Targets have an associated module and Ace (specified by 

handlers take two arguments:* pointer to the current buffer 65 a "Moduleld" object and an Ace*). They also have a name 

being processed, and the Ace associated with this action. in the name space contained in the resolver, which associates 

Example: Aces to applications. 
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Upcall 

An upcall is a form of procedure call initiated in the PE 
module and handled in the AP module. Upcalls provide 
communication between the "inline" portion of an applica- 
tion and its "slower path" executing in the host environment. 
Within the ASL, the upcall facility sends messages to the AP. 
Messages are defined below. The upcall class contains the 
member function call( ) — which takes objects of type Mes- 
sage* and sends them asynchronously to AP module. 

DowncallHander 

Adowncall is a form of procedure call initiated in the AP 
module and handled in the PE module. Downcalls provide 
the opposite direction of communication than upcalls. The 
class contains the member function direct( ) which provides 
a pointer to the member function of the Ace class that is to 
be invoked when the associated downcall is requested in the 
AP. The Ace member function pointed to takes a Message * 
type as argument. 

, Message 

Messages contain zero, one, or two blocks of message 
data, which are independently constructed using the Mes- 
sageBlock constructors (below). Uninitialized blocks will 
appear at the Upcall handler in the AP module as zero length 
messages. Member functions of the Message class include: 
msgl( ), msg2( ), lenl( ), len2( ) — returns addresses and 
lengths of the messages [if present]. Other member func- 
tions: clrl( ), clr2( ), done( ) — acknowledge receipt of a 
message and tree resources. 

MessageBlock 

The MessageBlock class is used to encapsulate a region of 
storage within the Policy Engine memory that will be used 
in a future Upcall Message. It also includes a method to be 
called when the service software has copied the data out of 
that storage and no longer needs it to be stable (and can 
allow it to be recycled). Constructor syntax is as follows: 

MessageBlock (char *msg, int len=0, DoneFp done=0); 
MessageBlock (Buffer *buf); 
MessageBlock (int len, int off=0); 
The first form specifies an existing data area to be used as the 
data source. If the completion callback function (DoneFp) is 
specified, it will be called when the data has been copied out 
of the source area. Otherwise, no callback is made and no 
special actions are taken after the data is copied out of the 
message block. If no length is specified, then the base 
pointer is assumed to point to a zero-terminated string; the 
length is calculated to include the null termination. The 
second form specifies a Buffer object; the data transferred is 
the data contained within the buffer, and the relative align- 
ment of the data within the 32-bit word is retained. The 
reference count on the buffer is incremented when the 
MessageBlock is created, and the callback function is set to 
decrement the reference count when the copy out is com- 
plete. This will have the effect of marking the packet as 
"busy" for any actions that check for busy buffers, as well 
as preventing the buffer from being recycled before the copy 
out is complete. The third form requests that MessageBlock 
handle dynamic allocation of a region of memory large 
enough to hold a message of a specified size. Optionally, a 
second parameter can be specified that gives the offset from 
the 32-bit word alignment boundary where the data should 
start. The data block will retain this relative byte offset 
throughout its transfer to the Application Processor. This 
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allows, for instance, allocating a 1514-byte data area with 
- 2^byte-offset, building an Ethernet -frame within- itj and- 
having any IP headers included in the packet land properly 
aligned on 32-bit alignment boundaries. 

Sets 

Sets are an efficient way to track a large number of 
equivalence classes of packets, so that state can be kept for 
all packets that have the same values in specific fields. For 

10 instance, the programmer might wish to count the number of 
packets that flow between any two specific IP address pairs, 
or keep state for each TCP stream. Sets represent collections 
of individual members, each one of which matches buffers 
with a specific combination of field values. If the program- 

15 mer instead wishes to form sets of the form "the set of all 
packets with IP header lengths greater than twenty bytes," 
then the present form of sets are not appropriate; instead, a 
Classification Predicate should be used. 
In NCL, the only information available regarding a set is 

20 whether or not a set contained a record corresponding to a 
vector of search keys. Within the ASL, all other set opera- 
tions are supported: searches, insertions, and removals. For 
searches conducted in the CE, the ASL provides access to 
additional information obtained during the search operation; 

25 specifically, a pointer to the actual element located (for 
successful searches), and other helpful information such as 
an insertion pointer (on failure). The actual elements stored 
in each set are of a class constructed by the compiler, or are 
of a class that the software vendor has subclassed from that 

30 class. The hardware environment places strict requirements 
on the alignment modulus and alignment offset for each set 
element. 

As shown in the NCL specification, a single set may be 
searched by several vectors of keys, resulting in multiple 
35 search results that share the same target element records. 
Each of these directives results in the construction of a 
function that fills the key fields of the suitable Element 
subclass from a buffer. 

Within the ASL, the class set is used to abstract a set. It 
40 serves as a base class for compiler generated classes specific 
to the sets specified in the NCL program (see below). 

Search 

The Search class is the data type returned by all set 
45 searching operations, whether provided directly by the ASL 
or executed within the classification engine. Member func- 
tions: ran( ) — true if the CE executed this search on a set, 
hit( ) — true if the CE found a match using this search, 
miss( ) — inverse of hit( ) but can return a cookie making 
50 inserts faster, toElement( ) — converts successful search 
result to underlying object, insert( ) — insert an object at the 
place the miss( ) function indicates we should. 
Element 

Contents of sets are called elements, and the NCL com- 
55 piler generates a collection of specialized classes derived 
from the Element base class to contain user-specified data 
within set elements. Set elements may have an associated 
timeout value, indicating the maximum amount of time the 
set element should be maintained. After the time out is 
60 reached, the set element is automatically removed from the 
set. The time out facility is useful for monitoring network 
activity such as packet flows that should eventually be 
cleared due to inactivity. 

65 Compiler-Generated Elt_<setname> Classes 

For each set directive in the NCL program, the NCL 
compiler produces an adjusted subclass of the Element class 
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called Elt„<setname>, substituting the name of the set for Class methods are available to construct Time objects for 

<setname>. This class is used to define the type of elements specified numbers of standard time units (microseconds,- 

of the specified set. Because each set declaration contains milliseconds, seconds, minutes, hours, days and weeks); 

the number of keys needed to search the set, this compiler- also, methods are provided by extraction of those standard 

generated class is specialized from the element base class for 5 time periods from any Time object. Member functions 

the number of words of search key being used. include: curr( ) — returns current real time, operators: +=, 

„ *=, /=, %=, «o, >>=, |=, =, &=, accessors+builders; used ), 

Compiler-Generated Set_<sctname> Classes ms£c( ^ ^ } mins( ^ hour( ^ days( } wcck( ^ which 

For each set directive in the NCL program, the NCL access or build Time objects using the specified number of 

compiler produces an adjusted subclass of the Element class io microseconds, milliseconds, seconds, minutes, hours, days, 

called SeL_<setname>, substituting the name of the set for and weeks, respectively. 
<setname>. This class is used to define the lookup functions 

of the specified set. The NCL compiler uses the number of Memory Pool 

words of key information to customize the parameter list for ^ Pool class prov ides a mechanism for fast allocation of 

the lookup function; the NCL sizejiint is used to adjust a 15 objects of fixed sizes at specified offsets from specified 

protected field within the class. Aces that needing to manipu- power-of-two alignments, restocking the raw memory 

late sets should include an object of the customized Set class resources from the PE module memory pool as required. The 

as a member of their Ace. constructor creates an object that described the contents of 

Events the memory pool and contains the configuration control 

The- Event class provides for execution of functions at 20 information for how future allocations will be handled, 

arbitrary times in the future, with efficient rescheduling of Special 'offset' and 'restock* parameters are used. The 

the event and the ability to cancel an event without destroy- offset parameter allows allocation of classes where a specific 

ing the event marker itself. A calendar queue is used to member needs to be strongly aligned; for example, objects 

implement the event mechanism. When constructing objects from the Buffer class contain an element called hard that 

of the Event class, two optional parameters may be speci- 25 must start at the beginning of a 2048-byte-aligned region, 

fied: the function to be called (which must be a member The restock parameter controls how much memory is allo- 

function of a class based on Event), and an initial scheduled cated from the surrounding environment when the pool is 

time (how long in the future, expressed as a Time object). empty. Enough memory is allocated to contain at least the 

When both parameters are specified, the event's service requested number of objects, of the specified size, at the 

function is set and the event is scheduled. If the Time 30 specified offset from the alignment modulus. Member func- 

parameter is not specified, the Event's service function is ti 0 n include: take( ) — allocate a chunk, free( ) — return a 

still set but the event is not scheduled. If the service function chunk to the pool, 
is not set, it is assumed that the event will be directed to a 

service function before it is scheduled in the future. Member Tagged Memory Pool 

functions of this class include: direct( ^specifies what 35 Qbjects that with ^ a reference back to the pool 

function to be executed at expiry, schedule( )-mdicates from whicfa ^ wefe taken are caUed ^ fe most 

how far in the future for event to trigger, cancel( useful for rases when me code that frees the object will not 

)-unschedule event, curr( )-get time of currently event. neceS sarily know what pool it came from. This class is 

Rate 40 similar to normal Memory Pools, except for internal details 

,„ „ , , .j iAl iA and the calling sequence for freeing objects back into the 

The Rate class provides a simp e way to track event rates ^ ^ ^ ad(Jitional oyer . 

and bandwidthsmorderto watch for rates exceeding desired ^ flexibU of bej ^ to free objectg 

values. The Rate constructor allows the application to * . ,. . ~ , W1 ~ c A. . . t 

. , ~ £ 4 . knowing which Tagged pool they came from; this is similar 

specify arbitrary sampling periods. The application can d most c maUoc . ^ 

(optionally) specify how finely to divide the sampling 45 t t . «.l *u * i- . ■ ~ t 

v . , T • . mentations. If the object has strong alignment requirements, 

penod^Larger divers result in more precise rate measure- ^ ^ wofd q{ ^ ^ much 

ment but require more overhead, since the Rate object (o ^ between ^ oW Fof ^ jf ^ objectg 

schedules Events for each of the shorter periods while there wefe n ^ were ^ {Q &M Qn 32 . b 

are events withm the onger pertod Member funcUons of boundari the additional word would cause anolher M 

his class include: clear( )-reset internal state add( 50 b f ^ Q ^ wasted adjacent objects 

) — bumps event count, count( ) — gives best estimate ol J r ° * / • \ . * i i 

current trailing rate of events over last/longer period. ^ Ta Sg ed class adds a second ( statlt ;) version of the ta ^ e 

method, which is passed the size of the object to be 

Time allocated. The Tagged class manages an appropriate set of 

The Time class provides a common format for carrying 55 pools based on possible object sizes, grouping objects of 

around a time value. Absolute, relative, and elapsed times similar size together to limit the number of pools and allow 

are handled identically. As conversions to and from int64 (a sharing of real memory between objects of slightly different 

sixty-four bit unsigned integer value) are provided, all scalar sizes. Member functions include: take( )-allocate a chunk, 

operators are available for use; in addition, the assignment free( )-return a chunk to a pool, 

operators are explicitly provided. Various other classes use 60 

Time objects to specify absolute times and time intervals. Dynamic 

For maximum future flexibility in selection of storage This class takes care of overloading the new and delete 

formats, the actual units of the scalar time value are not operators, redirecting the memory allocation to use a num- 

specified; instead, they are stored as a class variable. Extrac- ber of Tagged Pools managed by the NBACTION DLL. All 

lion of meaningful data should be done via the appropriate 65 classes derived from Dynamic share the same set of Tagged 

access methods rather than by direct arithmetic on the Time Pools; each pool handles a specific range of object sizes, and 

object. objects of similar sizes will share the same Tagged Pool. The 
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Internet Checksum Support 



cksum 



Description 



10 



15 



dynamic class has no storage requirements and no virtual 
functions. Thus, declaring objects derived from-Dynamic 
will not change the size or layout of your objects (just how 
they are allocated). Operators defined include: new( 
)-allocate object from underlying pool, delete( )-return to 
underlying pool. 

Name Dictionary 

The Name class keeps a database of named objects (that 
are arbitrary pointers in the memory address space of the 
ASL. It provides mechanisms for adding objects to the 
dictionary, finding objects by name, and removing them 
from the dictionary. It is implemented with a Patricia Tree (a 
structure often used in longest prefix match in routing table 
lookups). Member functions include: find( )-lookup string, 
name( )-return name of dictionary. 

2. ASL Extensions for TCP/IP 

The TCP/IP Extensions to the Action Services Library 
(ASL) provides a set of class definitions designed to make 20 
several tasks common to TCP/IP -based network- oriented 
applications easier. With functions spanning several protocol 
layers, it includes operations such as IP fragment reassembly 
and TCP stream reconstruction. Note that many of the 
functions that handle Internet data make use of 1 6 and 32-bit 25 
data types beginning with 'n' (such as nuintl6 and nuint32). 
These data types refer to data in network byte order (i.e. big 
endian). Functions used to convert between host and net- 
work byte such as htonl( ) (which converts a 32-bit word 
from host to network byte order), are also defined. 

3. The Internet Class 

Functions of potential use to any Internet application are 
grouped together as methods of the Internet class. These 
functions are declared static within the class, so that they 
may be used easily without requiring an instatiation of the 
Internet class. 



40 



45 



The Internet Checksum is used extensively within the 
TCP/IP protocols to provide reasonably high assurance that 
data has been delivered correctly. In particular, it is used in 
IP (for headers), TCP and UDP (for headers and data), ICMP 
(for headers and data), and IGMP (for headers). 

The Internet checksum is defined to be the l's comple- 
ment of the sum of a region of data, where the sum is 
computed using 16-bit words and l's complement addition. 

Computation of this checksum is documented in a number 
of RFCs (available from ftp://ds.internic.net/rfc): RFC 1936 50 
describes a hardware implementation, RFC 1624 and RFC 
1141 describe incremental updates, RFC 1071 describes a 
number of mathematical properties of the checksum and 
how to compute it quickly, RFC 1071 also includes a copy 
of IEN 45 (from 1978), which describes motivations for the 55 
design of the checksum. 

Hie ASL provides the following functions to calculate 
Internet Checksums: 
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Syntax 



static nuintl6 Internet: :cksum(u_char* base, int len); 







Parameters 


Parameter 


Type 


Description 


base 


unsigned 


The starting address of the data. 




char * 




len 


int 


The number of bytes of data. 



30 



Return Value 

Returns the Internet Checksum in the same byte order as 
the underlying data, which is assumed to be in network byte 
order (bit endian). 

psum 
Description 

Computes the 2's-complement sum of a region of data 
taken as 16-bit words. The Internet Checksum for the 
specified data region may be generated by folding any carry 
bits above the low-order 16 bits and taking the l's comple- 
ment of the resulting value. 

Syntax 

static unit32 Internet: :psum(u_base, int len); 



35 







Parameters 


Parameter 


Type 


Description 


base 


unsigned 


The starting address of the data. 




char * 




len 


int 


The number of bytes of data. 



Computes the Internet Checksum of the data specified. 
This function works properly for data aligned to any byte 65 
boundary, but may perform (significantly) better for 32-bit 
aligned data. 



Return Value 

Returns the 2's-complement 32-bit sum of the data treated 
as an array of 16-bit words. 

incrcksum 

Description 

Computes a new Internet Checksum incrementally. That 
is, a new checksum is computed given the original checksum 
for a region of data, a checksum for a block of data to be 
replaced, and a checksum of the new data replacing the old 
data. This function is especially useful when small regions 
of packets are modified and checksums must be updated 
appropriately (e.g. for decrementing IP ttl fields or rewriting 
address fields for NAT). 

Syntax 



static unitl6 

Internet: :incrcksum(nuintl6 ocksum, 
nuintl6 ndsum); 



nuintl6 odsum, 
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Parameters 


Parameter 


Type 


Description 


ocksum 


nuintl6 


The original checksum. 


odsum 


nuintlS 


The checksum of the old data. 


ndsum 


nuintlfi 


The checksum of the new (replacing) data. 
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Syntax 

static unit!6 apssum(IP4Header* hdr); 



Return Value 
Returns the computed checksum. 

asum 

Description 



10 



15 



Parameters 
Type Description 



hdr 



IP4Header * 



Pointer to the header. 



Return Value 
Returns the checksum. 

apasum 
Description 



The function asum computes the checksum over only tbe The m ^ behaves ^ um bu , CQVers 

IP source and destination addresses. ^ Tcp ACK fleM insUad of Uje mmber fidd 



Syntax . 

static unitl6 asum(IP4Header* hdr); 



25 



Syntax 

static uintl6 apasum(IP4Header* hdr); 



Parameter 



hdr 



Parameters 
Type Description 



IP4Hcader * 



Pointer to the header. 



30 



Parameter 



Parameters n 
Type Description 



IP4Header 4 



Pointer to the header. 



Return Value 
Returns the checksum. 

apsum 
Description 

The function apsum behaves like asum but includes the 
address plus the two 16-bit words immediately following the 
IP header (which are the port numbers for TCP and UDP). 

Syntax 

static unitl6 apsum(IP4Header* hdr); 



45 





Parameters 




Parameter 


Type 


Description 


hdr 


IP4Header * 


Pointer to the header. 



Return Value 
Returns the checksum. 

apssum 
Description 

The function apssum behaves like apsum, but covers the 
[P addresses, ports, plus TCP sequence number. 



Return Value 
Returns the checksum. 

apsasum 
Description 

The function apsasum behaves like apasum but covers the 
IP addresses, ports, plus the TCP ACK and sequence num- 
bers. 

Syntax 

static uintl6 apsasum(IP4Header* hdr); 



50 



Parameter 



Type 



Description 



hdr 



IP4Header ■ 



Pointer to the header. 



Return Value 

Returns the checksum, 
60 4. IP Support 

This section describes the class definitions and constants 
used in processing IP-layer data. Generally, all data is stored 
in network byte order (big endian). Thus, care should be 
taken by the caller to ensure computations result in proper 
65 values when processing network byte ordered data on little 
endian machines (e.g. in the NetBoost software-only envi- 
ronment on pc- compatible architectures). 
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5. IP Addresses 
The IP4Addr class defines 32-bit IP version 4 addresses. 

Constructors 

Description 

The class IP4Addr is the abstraction of an IP (version 4) 
address within the ASL. It has two constructors, allowing for 
the creation of the IPv4 addresses given an unsigned 32-bit 
word in either host or network byte order. In addition, the 
class is derived from miint32, so IP addresses may generally 
be treated as 32-bit integers in network byte order. 
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IP4Mask(unit32 mh); 
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Parameters 


Parameter 


TVpe 


Description 


mh 


uint32 


32-bit mask in ho&t byte order 


mn 


nuint32 


32-bit mask in network byte order 



10 



None. 



15 



Syntax 



IP4Addr (nuint32 an); 
IP4Addr (uint32 ah); 



Return Value 

leftcontig 
Description 

Returns true if all of the 1-bits in the mask are left- 
20 contiguous, and returns false otherwise. 

Syntax 







Parameters 




bool leftcontig( ); 




Parameter 


Type 


Description 


25 




Parameters 


an 
ah 


nuint32 
uint32 


Unsigned 32-bit word in network byte order. 
Unsigned 32-bit word in host byte order. 




None. 














Return Value 






Return Value 


30 


Returns true if all 
contiguous. 


the 1-bits in 


None. 




Example 


35 




bits 
Description 



The following simple example illustrates the creation of 
addresses: 



The function bits returns the number of left-contiguous 
1-bits in the mask (a form of "population count"). 



#include "NTBip.h" 

unit32 myhaddr=(128«24)|(32«16)|(12«8)|4; 

nuint32 mynaddr=htonl((128«24)|(32«16)|(12«8)|4); 

IP4Addr ipl (myaddr); 

IP4Addr ip2 (mynaddr); 
This example creates two IP4Addr objects, each of which 
refer to the IP address 128.32.12.4. Note the use of the htonl( 
) ASL function to convert the hose 32-bit work into network 
byte order. 

6. IP Masks 

Masks are often applied to IP addresses in order to 
determine network or subnet numbers, CIDR blocks, etc. 
The class IP4 Mask is the ASL abstraction for a 32-bit mask, 
available to be applied to an IPv4 address (or for any other 
use). 

Constructor 
Description 

Instantiates the IP4Mask object with the mask specified. 
Syntax 

IP4Mask (miint32 mn); 



40 



int bits ( ); 



45 None. 



Syntax 



Parameters 



Return Value 



Returns the number of left-contiguous bits in the mask. 
Returns -1 if the 1-bits in the mask are not left-contiguous. 

Example 



55 Unhide NBip.h 

uint32 mymask = 0xffSEff8O; // 255.255.255.128 or 125 

IP4Mask ipm (mymask); 

int nbits - ipm.bits( ); 

if (nbits >m 0) { 

sprintf(msgbuf, "Mask is of the form /%d", nbits); 
60 } else { 

sprintf (msgbuf, "Mask is not left-contiguous I"); 

} 



This example creates a subnet mask with 25 bits, and sets up 
a message buffer containing a string which describes the 
form of the mask (using the common "slash notation" for 
subnet masks). 
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1. IP Header 

ITie IP4Header class defines the standard IP header, where 
sub-byte sized fields have been merged in order to reduce 
byte-order dependencies. In addition to the standard IP 
header, the class includes a number of methods for conve- 
nience. The class contains no virtual functions, and therefore 
pointers to the IP4Header class may be used to point to IP 
headers received in live network packets. 

The class contains a number of member functions, some 
of which provide direct access to the header fields and other 
which provide computed values based on header fields. 
Members which return computed values are described indi- 
vidually; those functions which provide only simple access 
to fields are as follows: 



Function Return Type Description 



vW() 


mirnt8& 


u»() 


nuintS& 


len() 


nuintl6& 


id() 


nuintl6& 


offset( ) 


nuintl6& 


ttl() 


nuint8& 


proto( ) 


nuint8& 


chsum( ) 


miintl6& 


src() 


IP4Addr& 


dst() 


tP4Addr& 



Returns a reference to the byte containing the 

[P version and header length 

Returns a reference to the IP type of service 

byte 

Returns a reference to the IP datagram 
(fragment) length in bytes 
Returns a reference to the IP identification field 
(used for fragmentation) 
Returns a reference to the word containing 
fragmentation flags and fragment offset 
Returns a reference to the IP time-to- live byte 
Returns a reference to the IP protocol byte 
Returns a reference to the IP checksum 
Returns a reference to the IP source address 
Returns a reference to the IP destination 
address 



15 



The following member functions of the IP4Header class 35 
provide convenient methods for accessing various informa- 
tion about an IP header. 
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Syntax 



int hl( ); 
void hl(int h); 





Parameters 


Parameter Type 


Description 


h int 


Specifies the header length (in 32-bit words) to assign 




to the IP header 


Return Value 



The first form of this function returns the number of 32-bit 
words in the IP header. 



20 



hlen 



Description 



The function hlen returns the number of bytes in the IP 
header (including options). 



25 



int hlen( ); 



None, 



Syntax 



Parameters 



Return Value 



Returns the number of bytes in the IP header including 
options. 



optbase Description 

40 

n ■ . . The first form of this function ver returns the version field 

uescnpuon Qf ^ ]p ^ ^ 

Returns the location of the first IP option in the IP header The second form assigns the version number to the IP 
(if present). ' header. 



Syntax 
unsigned char* optbase( ); 



Norje. 



Parameters 



Return Value 



45 



50 



55 



Returns the address of the first option present in the 
header. If no options are present, it returns the address of the 
first byte of the payload. 



hi 



60 



Description 

Trie first form of this function returns the number of 32-bit 65 
words in the IP header. The second form modifies the header 
length field to be equal to the specified length. 



Syntax 



int ver( ); 
void ver(int v); 







Parameters 


Parameter 


Type 


Description 


V 


int 


Specifies the version number. 



Return Value 

The first form returns the version field of the IP header, 
payload 
Description 

The function payload returns the address of the first byte 
of data (beyond any options present). 
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Syntax 
unsigned char* payload( ); 

Parameters 

None. 

Return Value 

Returns the address of the first byte of payload data in the 
IP packet. 

psum 



10 



Description 



The function psum is used internally by the ASL library, 
but may be useful to some applications. It returns the 16-bit 
one's complement sum of the source and destination IP 
addresses plus 8-bit protocol field [in the low-order byte]. It 
is useful in computing pseudo-header checksums for UDP 20 
and TCP. 



uint32 psum( ); 



None. 



Syntax 



Parameters 
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-continued - 





Define 


Value 


Reference 


5 


IPTOS_RELI ABILITY 


0x04 


RFC 791, p. 12. 




IPTOS_MINCOST 


0x02 


RFC 1349. 



IP Precedence 

The following table contains the definitions for IP prece- 
dence. All are from RFC 791, p. 12 (not widely used). 



15 



25 



Define 



Value 



IPTDS_PREC_NETCONTROL 


OxEO 


IPTOS _PREC _INTERNETCONTROL 


OxCO 


IPTOS_PREC_CRITIC_ECP 


OxAO 


IPTOS_PREC_FLASHOVERRIDE 


0x80 


IPTOS _PRBC_FLASH 


0x60 


IPTOS _PREC_JMMEDIATE 


0x40 


IPTOS _PREC_PR10RITY 


0x20 


IPTOS_PREC_ROUTTNE 


0x00 



Option Definitions 

The following table contains the definitions for support- 
ing IP options. All definitions are from RFC 791, pp. 15-23. 



Return Value 30 

Returns the 16-bit one's complement sum of the source 
and destination IP addresses plus the 8 -bit protocol field. 

Definitions 35 

In addition to the IP header itself, a number of definitions 
are -provided for manipulating fields of the IP header with 
specific semantic meanings. 

40 





Fragmentation 


Define Value 


Description 


IP_DF 0x4000 
IP_MF 0x2000 
IP_OFFMASK OxlFFF 


Don't fragment flag, RFC 791, p. 13. 
More fragments flag, RFC 791, p. 13. 
Mask for determining the fragment offset from 
the IP header oSset( ) function. 




Limitations 


IP MAXPACKET 


65535 Maximum IP datagram size. 



45 



50 



Define 


Value 


Description 


IPOPT__COPIED(o) 


((o)&0x80) 


A macro which returns true if the 






option V is to be copied upon 






fragmentation. 


IPOPT_CLASS(o) 


((o)&0x60) 


A macro giving the option class for 






the option V 


IPOPT_NUMBER(o) 


((o)&0xlF) 


A macro giving the option number 






for the option V 


IPOPT_CONTROL 


0x00 


Control class 


IPOFT_RESERVEDl 


0x20 


Reserved 


IPOPT_DEBMEAS 


0x40 


Debugging and/or measurement 






class 


IPOPT_RESERVED2 


0x60 


Reserved 


IPOPT_EOL 


0 


End of option list. 


IPOPT_NOP 


1 


No operation. 


IPOPT_RR 


7 


Record packet route. 


IPOPT_TS 


68 


Tune stamp. 


IPOPT_SECURITY 


130 


Provide s, c, h, tec. 


IPOPT_LSRR 


131 


Loose source route. 


IPOPT„SATiD 


136 


Satnet ID. 


IPOPT_SSRR 


137 


Strict source route. 


IPOPT_RA 


148 


Router alert. 



Options Field Offsets 

The following table contains the offsets to fields in 
5 5 options other than EOL and NOP. 



IP Service Type 

The following table contains the definitions for IP type of 
service byte (not commonly used): 60 



Define 



Reference 



IPTOS_LOWDELAY 
IPTOS_THROUGHPUT 



0x10 
0x08 



RFC 791, p. 12. 
RFC 791, p. 12. 



Define 


Value 


Description 


IPOPT_OPTVAL 


0 


Option ID. 


IPOPT_OLEN 


1 


Option length. 


IPOPT_OFFSET 


2 


Offset within option. 


IPOPT_MINOFF 


4 


Minimum value of offset. 


7. Fragments and Datagrams 



The IP protocol performs adaptation of its datagram size 
by an operation known as fragmentation. Fragmentation 



12/22/2003, EAST Version: 1.4.1 



6,157,955 



97 



allows for an initial (large) IP datagram to be broken into a 
sequence of-IP-fragments r each of which-is -treated -as -an- 
independeat packet until they are received and reassembled 
at the original datagram's ultimate destination. Conventional 
IP routers never reassemble fragments but instead route 
them independently, leaving the destination host to reas- 
semble them. In some circumstances, however, applications 
running on the NetBoost platform may wish to reassemble 
fragments themselves (e.g. to simulate the operation of the 
destination host). 
8. IP Fragment class 

Within the ASL, a fragment represents a single IP packet 
(containing an IP header), which may or not be a complete 
IP layer datagram. In addition, a datagram within the ASL 
represents a collection of fragments. A datagram (or 
fragment) is said to be complete if it represents or contains 
all the fragments necessary to represent an entire IP-layer 
datagram. 

The IPb 4Fragment class is denned as follows. 
Constructors 
Description 

The IP4Fragment class provides the abstraction of a 
single IP packet placed in an ASL buffer (see the description 
of the Buffer elsewhere in the chapter). It has two construc- 
tors intended for use by applications. 

Hie first of these allows for specifying the buffer con- 
taining an IP fragment as the parameter bp. The loca- 
tion of the of the IP header within the buffer is the 
second argument. This is the most commonly-used 
constructor when processing IP fragments in ACE 
action code. 

The second form of the constructor performs the same 
steps as the first form, but also allocates a new Buffer 
object and copies the IP header pointed to by iph into 
the new buffer (if specified). This form of the construc- 
tor is primarily intended for creation of IP fragments 
during IP datagram fragmentation. If the specified 
header contains IP options, only those options which 
are copied during fragmentation are copied. 

Syntax 

IP4Fragment(Buffer* bp, IP4Header* iph); 
IP4Fragment(int maxiplen, IP4Header* protohdr-O); 







Parameters 


Parameter 


Type 


Description 


bp 


Buffer • 


The starting address of the buffer containing 






the IP fragment 


tnaxiplcn 


int 


The maximum size of the fragment being 






created; used to size the allocated Buffer. 


protohdr 


UMHeader * 


The [P4 header to copy into the buffer, if 






provided. If the header contains IP options, 






only those options normally copied during 






fragmentation are copied. 



None. 



Return Value 

Destructor 
Description 



15 



50 



55 



65 



~IP4Fragment( ); 



None. 



None. 



98 

Syntax 
Parameters 
Return Value 



hdr 
Description 

The function hdr returns the address of the IP header of 
the fragment. 



IP4Header* hdr( ); 



20 



Syntax 



Parameters 



None. 



Return Value 



25 



30 



Returns the address of the IP4Header class at the begin- 
ning of the fragment. 

payload 
Description 

The function payload returns the address of the first byte 
of data in the IP fragment (after the basic header and 
options). 



35 



Syntax 



u_char* payload( ); 



None. 



Parameters 



Return Value 



45 



Frees the fragment. 



Returns the address of the first byte of data in the IP 
fragment. 

buf 
Description 

The function buf returns the address of the Buffer struc- 
ture containing the IP fragment. 

Syntax 

Buffer* buf( ); 

Parameters 

None. 

Return Value ' 

Returns the address of the Buffer structure containing the 
IP fragment. This may return NULL if there is no buffer 
associated with the fragment. 

next 

Description 

Returns a reference to the pointer pointing to the next 
fragment of a doubly-linked list of fragments. This is used 
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to link together fragments when they are reassembled (in 
-Datagrams) ^ or queued r etc. -Typically, fragments are linked 
together in a doubly -linked list fashion with NULL pointers 
indicating the list endpoints. 

Syntax 
IP4Fragment*& next( ); 

Parameters 

None. 

Return Value 

Return a reference to the internal linked-list pointer 
prev 
Description 

Like next, but returns a reference to pointer to the 
previous fragment on the list. 

Syntax 
IP4Fragment*& prev( ); 

Parameters 

None. 

Return Value 

Returns a reference to the internal linked-list pointer, 
first 
Description 

The function first returns true when the fragment repre- 
sents the first fragment of a datagram. 



6,157,955 
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Syntax 

IP4Datagram* fragment (int mtu); 



10 



15 





Parameters 


Parameter Type 


Description 


mtu int 


The maximum transmission unit MTU size limiting 




the maximum fragment size 


Return Value 



Returns a pointer to an lP4Datagram object containing a 
doubly-linked list of IP4Fragment objects. Each fragment 
object is contained within a Buffer class allocated by the 
ASL library. The original fragment object (the one 
fragmented) is not freed by this function. The caller must 
20 free the original fragment when it is no longer needed. 

complete 

Description 

The function complete returns true when the fragment 
represents a complete IP datagram. 



25 



30 bool complete( ); 



None. 



35 



Syntax 



Parameters 



Return Value 



bool first( ); 



None. 



Syntax 



Parameters 



Return Value 



45 



50 



Returns true when the fragment represents a complete IP 
datagram (that is, when the fragment offset field is zero and 
there are not additional fragments). 

optcopy 

Description 

The static method optcopy is used to copy options from 
one header to another during IP fragmentation. The function 
will only copy those options that are supposed to be copied 
during fragmentation (i.e. for thos options x where the 
macro IPOPT_COPIED(x) is non zero (true)). 

Syntax 

static int optcopy(IP4Header* sre, IP4Header* dst); 



Returns true when the fragment represents the first frag- 
ment of a datagram. 

fragment 
Description 

Fragments an IP datagram comprising a single fragment. 
The fragment( ) funtion allocates Buffer structures to hold 
the newly-formed IP fragment and links them together. It 
returns the head of the doubly-linked list of fragments. Each 
fragment in the list will be limited in size to at most the 
specified MTU size. The original fragment is unaffected. 



60 





Parameters 


Parameter Type 


Description 


sic IP4Header * 


Pointer to the source IP header containing 




options 


dst IP4Header * 


Pointer to the destination, where the source 




header should be copied to 



Return Value 

Returns the number of bytes of options present in the 
destination IP header. 
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9. IP Datagram class 
" "The class " IP4Datagram~ represents a" collection of IP 
fragments, which may (or may not) represent a complete IP4 
datagram. Note that objects of the class IP4Datagram 
include a doubly-linked list of IP4Fragment objects in sorted 
order (sorted by IP offset). When IP fragments are inserted 
into a datagram (in order to perform reassembly), coalescing 
of data between fragments is not performed automatically. 
Thus, although the IP4Datagram object may easily deter- 
mine whether it contains a complete set of fragments, it does 
not automatically reconstruct a contiguous buffer of the 
original datagram's contents for the caller. 

This class supports the fragmentation, reassembly, and 
grouping of IP fragments. The IP4Datagram class is denned 
as follows: 

Constructors 
Description 
The class has two constructors. 

The first form of the constructor is used when creating a 
fresh datagram (typically for starting the process of 
reassembly). 

, The second form is useful when an existing list of 
fragments are to be placed into the datagram immedi- 
ately at its creation. 

Syntax 

IP4Datagram( ); 

IP4Datagram(IP4Fragment* frag); 
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fragment 
Description 

5 The fragment function breaks an IP datagram into a series 
of IP fragments, each of which will fit in the packet size 
specified by mtu. Its behavior is equivalent to the 
IP4Fragment: fragment (int mtu) function described previ- 
ously. 



10 



15 



20 



Syntax 

IP4Datagram* fragment (int mtu); 

Parameters 

See IP4Fragment::fragment (int mtu) above. 

Return Value 
See IP4Fragment::fragment (int mtu) above. 

insert 



Description 

25 The function insert inserts a fragment into the datagram. 
The function attempts to reassemble the overall datagram by 
checking the IP offset and ID fields. 



30 



Syntax 

int insert(IP4Fragment* frag); 





Parameters 


Parameter T^pe 


Description 


frag Defragment * 


Pointer to a doubly linked list of fragments 




used to create the datagram object 



Return Value 

Norje. 

Destructor 
Description 

The destructor calls the destructors for each of the frag- 
ments comprising the datagram and frees the datagram 
object. 

len 
Description 

The len function returns the entire length (in bytes) of the 
datagram, including all of its comprising fragments. Its 
value is only meaningful if the datagram is complete. 

Syntax 

int len( ); 

Parameters 

None. 

Return Value 

Returns the length of the entire datagram (in bytes). If the 
datagram contains multiple fragments, only the size of the 
first fragment header is included in this value. 





Parameters 


Parameter Type 


Description 


flag IP4Fragment * 


Pointer to the fragment being inserted. 



40 

Return Value 

Because this function can fail/act in a large number of 
ways, the following definitions are provided to indicate the 

45 results of insertions that were attempted by the caller. The 
return value is a 32-bit word where each bit indicates a 
different error or unusual condition. The first definition 
below, IPD_INSERT_ERROR is set whenever any of the 
other conditions are encountered. This is an extensible list 

5Q which may evolve to indicate new error conditions in future 
releases: 



Define 


Description 


IPD_INSERT__ERROR 


'Or' of all other error bits. 


IPD_INSERT__OH 


Head overlapped. 


IFD_INSERT_OT 


Tail overlapped. 


IPD_INSERT_MISMATCH 


Payload mismatch. 


IFD_INSERT_CKFAIL 


IP header checksum failed (if enabled) 
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nfrags 
Description 

65 

The function nfrags returns the number of fragments 
currently present in the datagram. 
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Syntax 



int nfrags( ); 

complete 
Description 

The function complete returns true when all fragments 
comprising the original datagram are present. 



bool complete( ); 



None. 



Syntax 



Parameters 



Return Value 



Returns a boolean value indicating when all fragments 20 
comprising the original datagram are present. 

head 



Description 

Hie function head returns the address of the first IP 
fragment in the datagram's linked list of fragments. 



Syntax 



IP4Fragment* bead( ); 



None. 



Parameters 



Return Value 



45 



Returns the address of the first IP fragment in the data- " 
gram's linked list of fragments. 

10. UDP Support 40 
The UDP protocol provides a best-effort datagram ser- 
vice. Due to its limited complexity, only the simple UDP 
header definitions are included here. Additional functions 
operating on several protocols (e.g. UDP and TCP NAT) are 
defined in subsequent sections. 

11. UDP Header 

The UDPHeader class defines the standard UDP header. It 
is defined in NBudp.h. In addition to the standard UDP 
header, the class includes a single method for convenience 50 
in accessing the payload portion of the UDP datagram. The 
class contains no virtual functions, and therefore pointers to 
the UDPHeader class may be used to point to UDP headers 
received in live network packets. 

The class contains a number of member functions, most 55 
of which provide direct access to the header fields. A special 
payload function may be used to obtain a pointer immedi- 
ately beyond the UDP header. The following table lists the 
functions providing direct access to the header fields: 

60 



Function Return Type Description 

sport( ) nuintl6& Returns a reference to the source UDP port 

number 65 
dport( ) nuintl6& Returns a reference to the destination UDP port 
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-continued 



Function Return Type Description 



Ien( ) miintl6& 
cksum() nuintl6& 



number 

Returns a reference to the UDP length field 
Returns a reference to the UDP pseudoheader 
checksum. UDP checksums are optional; a 
value of all zero bits indicate no checksum is 
was computed. 



The following function provides convenient access to the 
payload portion of the datagram, and maintains consistency 
with other protocol headers (i.e. IP and TCP). 



15 
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payload 
Description 

The function payload returns the address of the first byte 
of data (beyond the UDP header). 

Syntax 

unsigned char* payload( ); 

Parameters 

None. 

Return Value 

Returns the address of the first byte of payload data in the 
UDP packet. 

12. TCP Support 

The TCP protocol provides a stateful connection-oriented 
stream service. The ASL provides the TCP-specific 
definitions, including the TCP header, plus a facility to 
monitor the content and progress of an active TCP flow as 
a third party (i.e. without having to be an endpoint). For 
address and port number translation of TCP, see the section 
on NAT in subsequent sections of this document. 

13. TCP Sequence Numbers 

TCP rises sequence numbers to keep track of an active 
data transfer. Each unit of data transfer is called a segment, 
and each segment contains a range of sequence numbers. In 
TCP, sequence numbers are in byte units. If a TCP connec- 
tion is open and data transfer is progressing from computer 
A to B, TCP segments will be flowing from A to B and 
acknowledgements will be flowing from B toward A The 
acknowledgements indicate to the sender the amount of data 
the receiver has received. TCP is a bi-directiooal protocol^ so 
that data may be flowing sim ultaneo usly from A to B and 
from B to A. In such cases, each segment (in both directio ns) 
contains data for one direction of the connection an d 

acknowledgements f or other direction Qf the connection. 

Bot h sequence numbers (sending direction) and acknow l- 
edgement number s (reverse direction) use TCP sequenc e 
numbers as the data type in t he TCP header. TCP sequenc e 
numboiis ait 32-Wt unsigned numbers that are allowed to 
wrap beyond 2*32-1. Within the ASL, a special class called 
TCPSeq defines this class and associated operators, so that 
objects of this type may be treated like ordinary scalar types 
(e.g. unsigned integers). 
^4. TCP Header ^ 

*The TCPHeader"c7ass defines the standard TCP header. In 
addition to the standard TCP header, the class includes a set 
of methods for convenience in accessing the payload portion 
of the TCP stream. The class contains no virtual functions, 
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and therefore pointers to the TCPHeader class may be used 
to point- to TCP headers received -in -live network -packets. 

The class contains a number of member functions, most 
of which provide direct access to the header fields. A special 
payload function may be used to obtain a pointer immedi- 
ately beyond the TCP header. The following table lists the 
functions providing direct access to the header fields: 
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-continued - 



Parameters 



Parameter Type Description 



positions to scale window field) 



Function 


Return Type 


Description 


sport( ) 


nuintl6& 


Returns a reference to the source TCP port 




number 


dportQ 


nuintl6& 


Returns a reference to the destination TCP port 






number 


seq() 


TCPSeq& 


Returns a reference to the TCP sequence 






number 


ack() 


TCPSeq& 


Returns a reference to the TCP acknowledge- 






ment number 


off() 


nuintS 


Returns the number of 32-bit words in the TCP 




header (includes TCP options) 


fiags() 


miint8& 


Returns a reference to tie byte containing the 




6 flags bits (and % reserved bits) 


win( ) 


nuintl6& 


Returns a reference to the window advertise- 




ment field (unsealed) 


cksum( ) 


nuintl6& 


Returns a reference to the TCP pseudoheader 






checksum. TCP checksums are not optional. 


udp() 


nuintl6& 


Returns a reference to the TCP urgent pointer 




field 



10 



The following functions provides convenient access to other 
characteristics of the segment: 

payload 



Description 

The function payload returns the address of the first byte 35 
of data (beyond the TCP header). 

Syntax 

unsigned char* payload( ); ^ 
Parameters 

None. 

Return Value 45 

Returns the address of the first byte of payload data in the 
TCP packet. 

window 50 
Description 

The function window returns the window advertisement 
contained in the segment, taking into account the use of TCP 
large windows (see RFC 1323). 55 

Syntax 
uint32 window(int wshift) 

60 



Parameters 
Parameter Type Description 

wshift int The "window shift value" (number of left-shift bit 



Return Value 

Returns the receiver's advertised window in the segment 
(in bytes). This function is to be used when RFC1323-style 
window scaling is in use. 

optbase 
Description 

The function optbase returns the address of the first option 
in the TCP header, if any are present. If no options are 
present, it returns the address of the first payload byte (which 
may be urgent data if the URG bit is set in the flags field). 

Syntax 

u_char* optbase( ) 

Parameters 

None. 

Return Value 

Returns the address of the first byte of data beyond the 
urgent pointer field of the TCP header, 

hlen 

Description 

The first form of this function ver returns the TCP header 
length in bytes. The second form assigns the TCP header 
length to the number of bytes specified. 

Syntax 

int hlen( ); 

void hlen(int bytes); 





Parameters 


Parameter Type 


Description 


bytes int 


Specifies the number of bytes present in the TCP 




header 



Return Value 

The first form returns the number of bytes in the TCP 
header. 

Definitions 

In addition to the TCP header itself, a number of defini- 
tions are provided for manipulating options in TCP headers: 
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TCP Options 



Define 



Value Description 



TCPOPT_EOL 


0 


End of Option List 


TCPOPT__NOP 


1 


No operation (used for 






padding 


TCPOPT__MAXSEG 


2 


Maximum segment size 


TCPOPT_SACK_PERMnTED 


4 


Selective Acknowledgements 






available 


TCPOPT_SACK 


5 


Selective Acknowledgements 






in this segment 


TCPOPT_TIMESTAMP 


8 


Time 6tamps 


TCPOPT_CC 


11 


for TyTCP (see RFC 1644) 


TCPOPT_CCNEW 


12 


forT/TCP 


TCPOPT_CCECHO 


13 


forT/TCP 



15. TCP Following 

TCP operates as an 11-state finite state machine. Most of 
the states are related to connection establishment and tear- 
down. By following certain control bits in the TCP headers 
of segments passed along a connection, it is possible to infer 
the TCP state at each endpoint, and to monitor the data 
exchanged between the two endpoints. 

Defines 

The following definitions are for TCP state monitoring, 
and indicate states in the TCP finite state machine: 
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The class contains the following fields, all of which are 
declared public: - - 



Field Type Description 
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prcv_ TCPSeglnfo* 

next_ TCPSeglnfo* 

segment_ IP4Datagram* 

startseq„ TCPSeq 

endseq_ TCPSeq 

startbu£_ u_char* 

endbuf_ u_char* 

flags uint32 



Pointer to the next TCPScglnfo object of the 
forward linked list; NULL if no more 
Pointer to the previous TCPSegtnfo object of 
the reverse linked list; NULL if no previous 
segment exists 

Pointer to the datagram containing the TCP 
segment 

The starting sequence number for the 
segment 

The ending sequence number for the segment 

Pointer to the byte whose sequence number 

is specified by the startseq_field 

Pointer to the byte whose sequence number 

is specified by the eadseq_Jield 

Flags field for the segment (reserved as of 

the EA2 release) 



20 



The ReassemblyQueue Class 

The ReassemblyQueue class is a container class used in 
25 reconstructing TCP streams from TCP segments that have 
been "snooped" on a TCP connection. This class contains a 
list of TCPSeglnfo objects, each of which corresponds to a 
single TCP segment. The purpose of this class is not only to 
contain the segments, but to reassemble received segments 



Define 




Description 


TCPS_CLOSED 


0 


Closed 


TCPS_IISTEN 


1 


Listening for connection. 


TCPS_SYN_SENT 


2 


Active open, have sent SYN. 


TCPS_SYN_RECEIVED 


3 


Have sent and received SYN. 


TCPS_ESTABLISHED 


4 


Established. 


TCPS_CLOSE_WAlT 


5 


Received FIN, waiting for closed. 


TCPS _FIN_WAlT_1 


6 


Have closed, sent FIN. 


TCPS_CLOS[NG 


7 


Closed exchanged FIN; awaiting 






KNACK. 


TCPS_LAST_ACK 


8 


Had FIN and close; await FIN 






ACK. 


TCPS_FIN_WAIT__2 


9 


Have closed, FIN is acked. 


TCPS_TTME_WArr 


10 


In 2*MSL quiet wait after close. 


TCPS_HAVERCVDSYN(s) 


((B) >- 


True if state s indicates a SYN has 




TCPS SYN RECEIVED) 


been received 


TCPS_HAVEESTABLISHED(s) 


((»)>- 


True if state s indicates have 




TCPS ESTABLISHED) 


established ever 


TCPS_HAVERCVDFIN(s) 


«s)>- 


True if state s indicates a FIN ever 




TCPS_riME_WA[T) 


received 



Note 1: States less than TCPS_ESTAB LISHED indicate connections not yet established. 

Note 2: States greater than TCPS_CLOSE_WATT are those where the user has closed. 

Note 3; States greater than TCPS_CLOSE_WAiT and less than TCPS_J1N_WAIT_2 await ACK of FIN. 



The TCPSeglnfo Class 5S 

The TCPSeglnfo class is a container class for TCP seg- 
ments that have been queued during TCP stream reconstruc- 
tion and may be read by applications (using the Reassem- 
bly Queue:: read function, defined below). When segments 
are queued, they are maintained in a doubly-linked list 
sorted by sequence number order. Note that the list may 
contain "holes". That is, it may contain segments that are not 
adjacent in the space of sequence numbers because some 
data is missing in between. In addition, because retransmit- 
ted TCP segments can potentially overlap one another's data 
areas, the starting and ending sequence number fields 65 
(startseq_ and endseq_) may not correspond to the starting 
sequence number. 



as they arrive and present them in proper sequence number 
order for applications to read. Applications are generally 
able to read data on the connection in order, or to skip some 
fixed amount of enqued data. 

Constructor 

Description 

A ReassemblyQueue object is used internally by the TCP 
stream reconstruction facility, but may be useful to appli- 
cations in generally under some circumstances. It provides 
for reassembly of TCP streams based on sequence numbers 
contained in TCP segments. The constructor takes an argu- 
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ment specifying the next sequence number to expect. It is 
updated as additional segments are inserted into the object. 
If a segment is inserted which is not contiguous in sequence 
number space, it is considered "out of order" and is queued 
in the object until the "hole" (data between it and the 
previous in -sequence data) is filled. 

Syntax 

ReassemblyQueue (TCPSeq& rcvnxt) 



int add(IP4Fragment* fp, TCPSeq seq, uint32 dlen); 



10 



Parameter 


Type 


Description 


fp 


IP4Fragment* 


Pointer to an un fragmented EP fragment 






containing a TCP segment 


dp 


IP4Datagram* 


A pointer to a complete IP datagram 






containing a TCP segment 


seq 


TCPSeq 


Initial sequence number for the TCP segment 


dlen 


uint32 


Usable length of the TCP segment 



Parameter Type 



Description 



rcvnxt TCPSeq & A reference to the next TCP sequence number to 
expect. The sequence number referred to by 
rcvnxt is updated by the add function (see below) 
to always indicate the next in-order TCP sequence 
number expected 



15 
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None. 



Return Value 



Defines 



Return value 

Returns a 32-bit integer with the possible values indicated 
above (definitions beginning with RQ_). 

empty 

Description 

The empty functions return true if the reassembly queue 
contains no segments. 



bool emptyO 



The following definitions are provided for insertion of 
TCP segments into a ReassemblyQueue object, and are used 
as return values for the add function defined below. 30 None. 
Generally, acceptable conditions are indicated by bits in the 
low-order half-word, and suspicious or error conditions are 
indicated in the upper half-word. 



Syntax 



Parameters 



Return value 
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Define 



Value Description 



RQ_OK 0x00000000 Segment was non- overlapping 

and in-order 

RQ_OUTORDER 0x00000001 Segment was out of order (didn't ^ 

match next expected sequence 
number) 

RQ_LOW__OLAP 0x00000002 Segment's sequence number was 

below next expected but 
segment extended past next 
expected 

RQ_HIGH_OLAP 0x00000004 Segment's data overlapped 

another queued segment's data 
RQ_DUP 0x00000008 Completely duplicate segment 

RQ_BAD_HLEN 0x00010000 Bad header length {e.g. less than 

5) 

RQ_J3AD_RSVD 0x00020000 Bad reserved field (reserved bits 

are non-zero) 

RQ_FLAGS _j\LERT 0x00040000 Suspicious combination of flags 

(e.g. RST on or all on, etc) 
RQ_FLAGS _BADURP 0x00080000 t Bad urgent pointer 



Returns true if the reassembly queue contains no seg- 
ments. 

clear 
Description 

The clear function removes all queued segments from the 
reassembly queue and frees their storage. 



45 void clearQ 



None. 



None. 



55 



add 



Description 

The add function inserts an IP datagram or complete IP 
fragment containing a TCP segment into the reassembly 
queue. The TCP sequence number referenced by rcvnxt in 
the constructor is updated to reflect the next in-sequence 
sequence number expected. 

Syntax 

int add(IP4Datagram* dp, TCPSeq seq, uint32 dlen); 



Syntax 

Parameters 

Return value 

read 
Description 



The read function provides application access to the 
contiguous data currently queued in the reassembly queue. 

60 The function returns a linked list of TCPSeglnfo objects. 
The list is in order sorted by sequence number beginning 
with the first in-order sequence number and continues no 
further than the number of bytes specified by the caller. Note 
that the caller must inspect the value filled in by the call to 

65 determine how many byte worth of sequence number space 
is consumed by the linked list. This call removes the 
segments returned to the caller from the reassembly queue. 
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Syntax 

TCPSeglnfo* read(int& lcn); " 



Parameters 



Parameter Type Description 



len int& Contains the number of bytes worth of in-sequence 

data the application is interested in reading from the 
reassembly queue. The underlying integer is 
modified by this call to indicate the number of 
bytes actually covered by the list of segments 
returned. The call is guaranteed to never return a 
larger number of bytes than requested. 



Return value 

Returns a pointer to the first TCPSeglnfo object in a 
doubly-linked list of objects each of which points to TCP 
segments that are numerically adjacent in TCP sequence 
number space. 

The TCPEndpoint Class 

The TCPEndpoint class is the abstraction of a single 
endpoint of a TCP connection. In TCP, a connection is 
identified by a 4-tuple of two IP addresses and a two port 
numbers. Each endpoint is identified by a single IP address 
and port number. Thus, a TCP connection (or "session" — see 
below) actually comprises two endpoint objects. Each end- 
point contains the TCP finite state machine state as well as 
a Re assembly Queue object, used to contain queued data. 
The TCPEndpoint class is used internally by the TCPSession 
class below, but may be useful to applications in certain 
circumstances. 

Constructor 
Description 

Hie TCPEndpoint class is created in an empty state and 
is unable to determine which endpoint of a connection it 
represents. The user should call the init function described 
below after object instantiation to begin use of the object 



TCPEndpointQ 



None. 



Syntax 



Parameters 



None. 



-TCPEndpointQ 



None. 



Syntax 



Parameters 



25 



Return value 

Destructor 
Description 

Deletes all queued TCP segments and frees the object's 
memory. 



None. 
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Return value 



reset 



Description 



Resets the endpoint internal state to closed and clears any 
queued data. 



10 



15 



20 



-TCPEndpointQ 



None. 



None. 



Syntax 



Parameters 



Return value 



state 



Description 

Returns the current state in the TCP finite state machine 
associated with the TCP endpoint. 



30 



int stateQ 



None. 
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Syntax 



Parameters 



Return value 



45 



50 



Returns an integer indicating the internal state according 
to the definitions given above (defines beginning with 
TCPS„) 

init 
Description 

The init function provides initialization of a TCP endpoint 
object by specifying the IP address and port number the 
endpoint is acting as. After this call has been made, subse- 
quent processing of IP datagrams and fragments containing 
TCP segments (and ACKs) is accomplished by the process 
calls described below. 

Syntax 

void init(IP4Addr* myaddr, uintl6 myport); 







Parameters 


Parameter 


Type 


Description 


myaddr 


IP4Addr* 


A pointer to the IP address identifying this TCP 






endpoint 


myport 


nuintl6 


The port number (in network byte order) of port 




number identifying this TCP endpoint 



Return value 
None. 



65 



Description 



process 
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The process function processes an incoming or outgoing 
TCP segment relative to the TCP endpoint object. The first 
form of the function operates on a datatgram which must be 
complete; the second form operates on a fragment which 
must also be complete. Given that the TCPEndpoint object 
is not actually the literal endpoint of the TCP connection 
itself, it must infer state transitions at the literal endpoints 
based upon observed traffic. Thus, it must monitor both 
directions of the TCP connection to properly follow the state 
at each literal endpoint. 
Syntax 

int process(IP4Datagram* pd); 
int process(IP4Fragment* pf); 
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-continued 







Parameters 


Parameter 


Type 


Description 


Pf 


IP4Fragment* 


ing the first TCP segment on the connection 
Pointer to a complete IP fragment containing 
a the first TCP segment on the connection 






Return value 


None. 




Destructor 







Parameters 


Parameter 


Type 


Description 


pd 


rP4Datagram* 


A pointer to a complete IP datagram contain- 






ing a TCP segment 


Pf 


rP4Fragmenf 


Pointer to an unfragmented IP fragment 






containing a TCP segment 



Return value 

Returns a 32-bit integer with the same semantics defined 
for ReassemblyQueue::add (see above). 
The TCPSession Class 

The TCPSession class is the abstraction of a complete, 
bi-directional TCP connection. It includes two TCP endpoint 
objects, which each include a reassembly queue. Thus, 
provided the TCPSession object is able to process all data 
sent on the connection in either direction it will have a 
reasonably complete picture of the progress and data 
exchanged across the connection. 

Constructor 

Description 

The TCPSession object is created by the caller when a 
TCP segment arrives on a new connection. The session 
object will infer from the contents of the segment which 
endpoint will be considered the client (the active opener — 
generally the sender of the first SYN), and which will be 
considered the server (the passive opener — generally the 
sender of the first SYN+ACK). In circumstances of simul- 
taneous active opens (a rare case when both endpoints send 
SYN packets), the notion of client and server is not well 
defined, but the session object will behave as though the 
sender of the first SYN received by the session object is the 
client. In any case, the terms client and server are only 
loosely defined and do not affect the proper operation of the 
object. 

Syntax 

TCPSession (IP4Datagram* dp); 
TCPSession (IP4Fragment* fp); 



Parameters 
Parameter Type Description 

pd IP4Datagram" A pointer to a complete IP datagram contain- 



Description 

Deletes all TCP segments queued and frees the object's 
20 memory. 

Syntax 

-TCPSessionO 
25 Parameters 
None. 

Return value 

30 None. 

process 
Description 

35 The process function processes a TCP segment on the 
connection. The first form of the function operates on a 
datagram which must be complete; the second form operates 
on a fragment which must also be complete. This function 

^ operates by passing the datagram or fragment to each 
endpoint's process function. 
Syntax 

int process (IP4Datagram ,, ' dp); 
int process (IP4Fragment* fp); 

45 



50 







Parameters 


Parameter 


Type 


Description 


pd 


IP4Datagram* 


A pointer to a complete IP datagram con- 






taining a TCP segment 


Pf 


IP4Fragment* 


Pointer to an unfragmented IP fragment con- 






taining a TCP segment 



55 

Return value 

Returns a 32-bit integer with the same semantics defined 
for ReassemblyQueue::add (see above). The value returned 
60 will be the result of calling the add function of the reassem- 
bly queue object embedded in the endpoint object corre- 
sponding to the destination address and port of the received 
segment. 

16. Network Address Translation (NAT) 
65 Network Address Translation (NAI) refers to the general 
ability to modify various fields of different protocols so that 
the effective source, destination, or source and destination 



12/22/2003, EAST Version: 1.4.1 



6,1' 

115 

entities are replaced by an alternative. The definitions to 
- perform NAT for the-IPrUDP, and TCP protocols are defined- 
within the ASI. The NAT implementation uses incremental 
checksum computation, so performance should not degrade 
in proportion to packet size. 

17. IP NAT 

IP address translation refers to the mapping of an IP 
datagram (fragment) with source and destination IP address 
(sl 7 dl) to the same datagram (fragment) with new address 
pair (s2,d2). A source-rewrite only modifies the source 
address (dl is left equal to d2). A destination rewrite implies 
only the destination address is rewritten (si is left equal to 
s2). A source and destination rewrite refers to a change in 
both the source and destination IP addresses. Note that for IP 
NAT, only the IP source and/or destination addresses are 
rewritten (in addition to rewriting the IP header checksum). 
For traffic such as TCP or UDP, NAT functionality must 
include modification of the TCP or UDP psuedoheader 
checksum (which covers the IP header source and destina- 
tion addresses plus protocol field). Properly performing 
NAT on TCP or UDP traffic, requires attention to these 
details. 

18. IP NAT Base Class 

Trie class IPNat provides a base class for other IP NAT 
classes. Because of the pure virtual function rewrite, appli- 
cations will not create objects of type IP4Nat direcdy, but 
rather use the objects of typeIP4SNat, IP4DNat, and 
IP4SDNat defined above. 

rewrite 
Description 

This pure-virtual function is defined in derived classes. It 
performs address rewriting in a specific fashion imple- 
mented by the specific derived classes (i.e. source, 
destination, or source/destination combination). The rewrite 
call, as applied to a fragment, only affects the given frag- 
ment. When applied to a datagram, each of the fragment 
headers comprising the datagram are re -written. 

Syntax 

virtual void rewrite (IP4Datagram*fp)-0; 
virtual void rewrite (IP4Fragment*fp)=0; 







Parameters 


Parameter 


Type 


Description 


dp 


lP4Datagram * 


Pointer to the datagram to rewrite 


fp 


EMFragment * 


Pointer to the single fragment to rewrite 



Return value 

None. 

There are three classes available for implementing IP 
NAT, all of which are derived from the base class IP4Nat. 
The classes IP4SNat, IPDNat, and IPSDNat define the 
structure of objects implementing source, destination, and 
source/destination rewriting for IP datagrams and fragments. 

19. IP4SNat class 

The IP4SNat class is derived from the IP4Nat class. It 
defines the class of objects implementing source rewriting 
for IP datagrams and fragments. 
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Constructor 

Description 
Instantiates the IP4SNat object. 

Syntax 

IP4SNat (IP4Addr* newsrc); 



10 














Parameters 




Parametei 


Type 


Description 


15 


newsrc 


IP4Addr * 


Pointer to the new source address for IF NAT. 








Return value 


20 


None. 




rewrite 



Description 

25 

Defines the pure virtual rewrite functions in the parent 
class. 

Syntax 

30 void rewrite (IP4Datagram* dp); 
void rewrite (IP4Fragment* fp); 







Parameters 


Parameter 


Type 


Description 


dp 


IP4Datagram * 


Pointer to the datagram to be rewritten (all 






fragment headers are modified) 


fp 


IP4Fragmcnt * 


Pointer to the fragment to rewrite (only the 






single fragment header is modified) 



Return value 

45 None. 

20. IP4Nat class 

The IP4DNat class is derived from the IP4Nat class. It 
defines the class of objects implementing destination rewrit- 
50 ing for IP datagrams and fragments. 

Constructor 
Description 
Instantiates the IP4DNat object- 
Syntax 

lP4DNat (IP4Addr* newdst); 



Parameters 
Parameter Type Description 

newdst IP4Addr * Pointer to the new destination address for IP 
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-continued 




Parameters 


Parameter Type 


Description 


NAT. 




Return value 


None. 


rewrite 
Description 


Defines the pure virtual rewrite functions in the parent 
class. 




Syntax 


void rewrite (lP4Datagram* dp); 
void rewrite (IP4Fragment* fp); 




Parameters 


Parameter Type 


Description 
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Syntax 

void rewrite (IP4Datagram* dp); 
void rewrite (IP4Fragment* fp); 



25 



dp IP4Datagram * Pointer to the datagram to be rewritten (all 

fragment headers are modified) 
fp IP4Fragmcnt * Pointer to the fragment to rewrite (only the , 

• single fragment header is modified) 



Return value 

None. 
21. IP4SDNat class 

The ICP4SDNat class is derived from the IP4Nat class. It 
defines the class of objects implementing source and desti- 
nation rewriting for IP datagrams and fragments. 

Constructor 

Description 

Instantiates the IP4SDNat object. 

Syntax 

IP4SDNat (IP4Addr* newsrc, IP4Addr* newdst); 



35 



45 







Parameters 


Parameter 


Type 


Description 


nesic 


IP4Addr* 


Pointer to the new source address for IP NAT. 


newdst 


IP4Addr * 


Pointer to the new destination address for IP 






NAT. 



None. 



Return value 

rewrite 
Description 



Defines the pure virtual rewrite functions in the parent 
class. 



10 



15 



20 







Parameters 


Parameter 


Type 


Description 


dp 


IP4Datagram * 


Pointer to the datagram to be rewritten (all 






fragment headers are modified) 


fp 


IP4Fragment * 


Pointer to the fragment to rewrite (only the 






single fragment header is modified) 



None. 



Return value 



Example 



For fragments, only the single fragment is modified. For 
datagrams, all comprising fragments are updated. The fol- 
lowing simple example illustrates the use of one of these 
objects: 

Assuming ipal is an address we wish to place in the IP 
packet's destination address field, buf points to the ASL 
buffer containing an IP packet we wish to rewrite, and iph 
points the IP header of the packet contained in the buffer. 
IPDNat*ipd«new IPDNat(&ipal); //create IP DNat object 
IP4Fragment ipf(buf, iph); //create IP fragment object 
ipd-»rewrite(&ipf); //rewrite fragment's header 
The use of other IP NAT objects follows a similar pattern. 

22. UDP NAT 

The organization of the UDP NAT classes follows the [P 
Nat classes very closely. The primary difference is in the 
handling of UDP ports. For UDP NAT, the optional rewriting 
of port numbers (in addition to IP layer addresses) is 
specified in the constructor. 

23. UDPNat base class 

The class UDPNat provides a base class for other UDP 
NAT classes. The constructor is given a value indicating 
whether port number rewriting is enabled. Because of the 
pure virtual function rewrite, applications will not create 
objects of type UDPNat directly, but rather use the objects 
of type UDPSNat, UDPDNat, and UDPSDNat defined 
below. 

Constructor 



Description 

The constructor is given a value indicating whether port 
55 number rewriting is enabled. 



Syntax 



UDPNat (bool doports); 



60 



Parameter Type 



Parameters 
Description 



6$ doports bool. 



Boolean value indicating whether the port number 
rewriting is enabled. A true value indicates port 
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-continued" 



Parametej type Description 
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number rewriting is enabled. 
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24. UDPSNat class 

-The UPDSNat class is derivedfrom the UDPNat class; It- 
defines the class of objects implementing source address and 
(optionally) port number rewriting for complete and frag- 
mented UDP datagrams. 

Constructors 



Return value 10 

None. 

rewrite 

Description 15 

This pure-virtual function is defined in derived classes. It 
performs address rewriting in a specific fashion imple- 
mented by the specific derived classes (i.e. source, 
destination, or source/destination combination). The rewrite 20 
call, as applied to a fragment, only affects the given frag- 
ment. When applied to a datagram, each of the fragment 
headers comprising the datagram are re-written. 

Syntax 25 

virtual void rewrite(IP4Datagram*fp)=0; 
virtual void rewrite(IP4Fragment*fp)=0; 



30 







Parameters 


Parameter 


Type 


Description 


dp 


rP4Datagram * 


Pointer to the datagram to rewrite 




IF4Fragment * 


Pointer to the single fragment to rewrite 



35 



Description 

The single-argument constructor is used to create UDP 
NAT objects that rewrite only the addresses in the IP header 
(and update the IP header checksum and UDP pseudo- 
header checksum appropriately). Trie two-argument con- 
structor is used to create NAT objects that also rewrite the 
source port number in the UDP header. For fragmented UDP 
datagrams, the port numbers wiU generally be present in 
only the first fragment. 

Syntax 

UDPSNat (IP4Addr* newsaddr, nuintl6 newsport); 
UDPSNat (IP4Addr* newsaddr); 







Parameters 


Parameter 


Type 


Description 


newsaddr 
newsport . 


IP4Addr • 
nuintl 6 


Pointer the new source address to be used 
The new source port number to be used 



Return value 

None. 

rewrite 



Return value 

None. 40 
ports 
Description 

The first form of this function returns true if the NAT 45 
object is configured to rewrite port numbers. The second 
form of this function configures the object to enable or 
disable port number rewriting using the values true and 
false, respectively. 

50 

Syntax 

bool portsQ; 
void ports(bool p); 

55 





Parameters 


Parameter Type 


Description 


p bool 


Boolean containing whether port rewriting is 




enabled. 
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Return value 

The first form of this function returns true is the NAT 
object is configured to rewrite UDP port numbers. 



Description 

Defines the pure virtual rewrite functions in the parent 
class. 

Syntax 

void rewrite(IP4Datagram* dp); 
void rewrite(IP4Fragment* fp); 



Parameters 
Parameter Type Description 

dp IP4Datagram * Pointer to the datagram to be rewritten (all 

fragment headers are modified) 

fp IP4Fragment * Pointer to the fragment to rewrite (only the 

single fragment header is modified). Should 
only be called when the fragment represents 
a complete UDP/IP datagram. 



Return value 

None. 

25. UDPDNat class 

The UDPDNat class is derived from the UDPNat class. It 
defines the class of objects implementing destination address 
and (optionally) port number rewriting for complete and 
fragmented UDP datagrams. 
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Constructors 
Description 

The single-argument constructor is used to create UDP 
NAT objects that rewrite only the addresses in the IP header 
(and update the IP header checksum and UDP pseudo- 
header checksum appropriately). The two-argument con- 
structor is used to create NAT objects that also rewrite the 
destination port number in the UDP header. For fragmented 
UDP datagrams, the port numbers will generally be present 
in only the first fragment. 

Syntax 

UDPSNat (IP4Addr* newdaddr, nuintl6 newdport); 
UDPSNat (IP4Addr* newdaddr); 



Parameters 
Parameter Type Description 

newdaddr QMAddr * Pointer the new destination address to be used 
newdport nuintl 6 The new destination port number to be used 



Return value 

None. 

rewrite 
Description 

Defines the pure virtual rewrite functions in the parent 
class. 

Syntax 

void rewrite(IP4Datagram* dp); 
void rewrite(IP4Fragment* fp); 



Parameters 
Parameter Type Description 

dp IP4Datagram * Pointer to the datagram to be rewritten (all 

fragment headers are modified) 

fp IP4Fragment * Pointer to the fragment to rewrite (only the 

single fragment header is modified). Should 
only be called when the fragment represents 
a complete UDP/IP datagram. 



Return value 

None. 

26. UDPSDNat class 

The UDPSDNat class is derived from the UDPNat class. 
It defines the class of objects implementing source and 
destination address and (optionally) port number rewriting 
for complete and fragmented UDP datagrams. 

Constructors 

Description 

Hie two-argument constructor is used to create UDP NAT 
objects that rewrite only the addresses in the IP header (and 
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update the IP header checksum and UDP pseudo-header 
checksum appropriately). -The four- argument constructor is 
used to create NAT objects that also rewrite the source and 
destination port number in the UDP header. For fragmented 
5 UDP diagrams, the port numbers will generally be present in 
only the first fragment. 

Syntax 

1Q UDPSNat (IP4Addr* newsaddr, nuintl6 newsport, 
IP4Addr* newdaddr, nuintl6 newdport); 

UDPSNat (IP4Addr* newsaddr, IP4Addr* newdaddr); 



15 






Parameters 




Parameter 


TVpe 


Description 


20 


newsaddr 
newsport 
newdaddr 
newdport 


IP4Addr * 
nuintl 6 
IP4Addr * 
nuintl 6 


Pointer the new source address to be used 
The new source port number to be used 
Pointer the new destination address to be used 
The new destination port number to be used 


25 


None. 




Return value 
rewrite 


30 






Description 



Defines the pure virtual rewrite functions in the parent 
class. 



Syntax 

void rewritef^Datagram* dp); 
void rewrite(IP4Fragment* fp); 



Parameters 
Parameter Type Description 

dp IP4Datagram * Pointer to the datagram to be rewritten (all 

fragment headers are modified) 
45 fp IP4Fragment * Pointer to the fragment to rewrite (only the 

single fragment header is modified). Should 
only be called when the fragment represents 
a complete UDP/IP datagram. 



50 

Return value 

None. 

27. TCP NAT 

5S The structure of the TCP NAT support classes follow the 
UDP classes very closely. The primary difference is in the 
handling of TCP sequence and ACK numbers. 

28. TCPNat base class 

The class TCPNat provides a base class for other TCP 
eo NAT classes. The constructor is given a pair of values 
indicating whether port number, sequence number, and 
acknowledgment number rewriting is enabled. Sequence 
number and ACK number rewriting are coupled such that 
enabling sequence number rewriting for source -rewriting 
65 will modify the sequence number field of the TCP segment, 
but enabling sequence number rewriting for destination- 
rewriting will instead modify the ACK field. This arrange- 
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ment makes it possible to perform NAT on TCP streams 
" through unnecessary complexity in the TCP NAT interface. 
Because of the pure virtual function rewrite, applications 
will not create objects of type TCPNat directly, but rather 
use the objects of type TCPSNat, TCPDNat, and TCPSDNat 
defined below. 

Constructor 



124 



disable port number rewriting using the values true and 
false,-respectively. - - - - 



Syntax 



bool portsO; 
void ports(bool p); 



Description 

The constructor is given a value indicating whether port 
number rewriting is enabled. 

Syntax 

TCPNat (bool doports, bool doseqs); 



Parameters 

Parameter Type Description 

doports bool Boolean value indicating whether the port number 
rewriting is enabled. A true value indicates port 
number rewriting is enabled. 

doseqs bool Boolean value indicating whether the sequence/ACK 
number rewriting is enabled. A true value indicates 
sequencc/ACK number rewriting is enabled. 



Return value 

None. 

rewrite 
Description 

This pure -virtual function is defined in derived classes. It 
performs address rewriting in a specific fashion imple- 
mented by the specific derived classes (i.e. source, 
destination, or source/destination combination). The rewrite 
call, as applied to a fragment, only affects the given frag- 
ment. When applied to a datagram, each of the fragment 
headers comprising the datagram are re-written. 

Syntax 

virtual void rewrite(IP4Datagratn* dp)=0; 
virtual void rewrite(IP4Fragment* fp)=0; 







Parameters 


Parameter 


TVpe 


Description 


dp 


IP4Datagram * 


Pointer to the datagram to rewrite 


fl> 


IP4Fragment • 


Pointer to the single fragment to rewrite 



Return value 

None. 

ports 
Description 

The first form of this function returns true if the NAT 
object is configured to rewrite port numbers. The second 
form of this function configures the object to enable or 



10 





Parameters 


Parameter Type 


Description 


p bool 


Boolean indicating whether port number rewriting is 




enabled. 


Return value 



The first form of this function returns true if the NAT 
20 object is configured to rewrite TCP port numbers. 



seqs 
Description 

25 The first form of this function returns true if the NAT 
object is configured to rewrite sequence/ACK numbers. The 
second form of this function configures the object to enable 
or disable sequence/ACK number rewriting using the values 

^ true and false, respectively. 

Syntax 

bool seqs(); 
void seqs(bool s); 

35 



40 





Parameters 


Parameter Type 


Description 


s bool 


Boolean indicating whether sequence/ACK number 




rewriting is enabled. 


Return value 



The first form of this function returns true if the NAT 
object is configured to rewrite TCP port numbers. 

29. TCPSNat class 
50 The TCPNat class is derived from the TCPNat class. It 
defines the class of objects implementing source address and 
(optionally) port number and sequence number rewriting for 
complete and fragmented TCP segments. 

55 Constructors 

Description 

The single-argument constructor is used to create TCP 
NAT objects that rewrite only the addresses in the IP header 

60 (and update the IP header checksum and TCP pseudo-header 
checksum appropriately). The two-argument constructor is 
used to create NAT objects that also rewrite the source port 
number in the TCP header. The three-argument constructor 
is used to rewrite the IP address, source port number, and to 

65 modify the TCP sequence number by a relative (constant) 
amount. The sequence offset provided may be positive or 
negative. 
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Syntax 

TCPSNat (I~P4Addr~* newsaddr); 
TCPSNat (IP4Addr* newsaddr, nuintl6 newsport); 
TCPSNat (IP4Addr* newsaddr, nuintl6 newsport, long 
seqoff) 



Parameter Type 



Description 



newsaddr IF4Addr * Pointer the new source address to be used 
newsport nuintl6 The new source port number to be used 
seqoff long Relative change to make to TCP sequence number 

fields. A positive value indicates the TCP 
sequence number is increased by the amount 
specified. A negative value indicates the 
sequence number is reduced by the amount 
specified. 



Return value 



None. 



rewrite 
Description 

Defines the pure virtual rewrite functions in the parent 
class. 

Syntax 

void rewrite(IP4Datagram* dp); 
void rewrite(IP4Fragment* fp); 



126 



a relative (constant) amount. The ACK offset provided may 
be positive or negative. - _ _ . _ . 

Syntax 

TCPSNat (IP4Addr* newsaddr, IP4Addr* newdaddr); 
TCPSNat (IP4Addr* newsaddr, nuintl6 newsport, 
IP4Addr* newdaddr, nuintl6 newdport); 

TCPSNat (IP4Addr* newsaddr, nuintl6 newsport, long 
io seqoff, IP4Addr* newdaddr, nuintl6 newdport, long ackofl); 



Parameters 



15 



Parameter Type 



Description 



newsaddr IP4Addr 
newsport nuintl6 
seqoff long 



20 



newdaddr IP4Addr 
newdport nuintl6 
25 ackoff long 



The new source address to be used 
The new source port number to be used 
Relative change to make to TCP sequence number 
fields. A positive value indicates the TCP 
sequence number is increased by the amount 
specified. A negative value indicates the 
sequence number is reduced by the amount 
specified. 

The new destination address to be used 
The new destination port number to be used 
Relative change to make to TCP ACK number 
fields. A positive value indicates the TCP ACK 
number is increased by the amount specified. A 
negative value indicates the ACK number is 
reduced by the amount specified. 



Return value 



None. 
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rewrite 



Parameter Type 



Parameters 
Description 



dp IP4Datagram * Pointer to the datagram to be rewritten (all 

fragment headers are modified) 

fp IF4Fragmcnt * Pointer to the fragment to rewrite (only the 

single fragment header is modified). Should 
only be called when the fragment represents 
a complete TCP/IP segment 



Description 

Defines the pure virtual rewrite functions in the parent 
40 class. 

Syntax 

void rewrite(IP4Datagram* dp); 
45 void rewrite(IP4Fragment* fp); 



Return value 

None. 

30. TCPSDNat class 

The TCPSDNat class is derived from the TCPNat class. It 
defines the class of objects implementing source address and 
(optionally) port number and sequence number/ACK num- 
ber rewriting for complete and fragmented TCP segments. 

Constructors 

Description 

The two -argument constructor is used to create TCP NAT 
objects that rewrite only the addresses in the IP header (and 
update the IP header checksum and TCP pseudo -header 
checksum appropriately). The four-argument constructor is 
used to create NAT objects that also rewrite the source and 
destination port numbers in the TCP header. The three - 
argument constructor is used to rewrite the IP address, 
source port number, and to modify the TCP ACK number by 
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Type 


Description 


dp 


IP4Datagram * 


Pointer to the datagram to be rewritten (all 






fragment headers are modified) 


fp 


IP4Fragment * 


Pointer to the fragment to rewrite (only the 






single fragment header is modified). Should 


55 




only be called when the fragment represents 




a complete TCP/IP segment. 



Return value 

60 

None. 

Those skilled in the art will appreciate variations of the 
above described embodiments. In addition to these 
embodiments, other variations will be appreciated by those 
65 skilled in the art. As such, the scope of the invention is not 
limited to the specified embodiments, but is defined by the 
following claims. 
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What is claimed is: 
- 1. -A- system comprising: - - .. - 

an application processor having a host interface; 

a bus bridge coupled to said host interface; and 

a policy engine coupled to said bus bridge to classify 
packets according to at least one of a plurality of 
application-specific classification policies and to enable 
at least one of a plurality of actions to be performed on 
said classified packets responsive to said packet clas- 
sifications. 

2. The system of claim 1, wherein said policy engine 
comprises: 

at least one media access controller to receive and/or 

transmit said packets; and 
at least one classification engine to classify said packets. 

3. The system of claim 2, wherein said classification 
engine comprises a microprogrammed processor to acceler- 
ate predicate analysis in network infrastructure applications. 

4. The system of claim 3, wherein said microprogrammed 
processor selectively processes each of said packets by 
performing thereon at least a subset of packet-based opera- 
tions including: 

packet header parsing, and 
hash-table lookups. 

5. The system of claim 2, wherein said at least one of a 
plurality of actions comprise causing at least one of said 
classified packets to be returned to said classification engine 
to be reclassified. 

6. The system of claim 1, further comprising a plurality of 
data buffers to store said packets and classification informa- 
tion about said packets. 

7. The system of claim 6, wherein said policy engine 
further comprises a plurality of producer/consumer ring 
arrays to store at least one pointer to at least one of said 
plurality of data buffers. 

8. The system of claim 1, wherein said classification 
policies are dynamically supplied by an application execut- 
ing on said application processor. 

9. The system of claim 1, wherein said policy engine 
further comprises at least one other policy engine coupled to 
said bus bridge. 

10. In a network switching device, a policy engine com- 
prising: 

at least one media access controller to receive and/or 
transmit packets between said policy engine and a host; 

at least one classification engine to classify said packets 
according to at least one of a plurality of application- 
specific classification policies; and 

a plurality of data buffers coupled to said at least one 
media controller and said at least one classification 
engine to store said packets and facilitate said classi- 
fications. 

11. The policy engine of claim 10, wherein said classifi- 
cation engine comprises a microprogrammed processor to 
accelerate predicate analysis in network infrastructure appli- 
cations. 

12. The policy engine of claim 11, wherein said micro- 
programmed processor selectively processes each of said 
packets by performing at least a subset of a plurality of 
packet-based operations including: 

packet header parsing, and 
hash- table lookups. 

13. The policy engine of claim 10, wherein said 
application-specific classification policies are provided by 
said host. 

14. The policy engine of claim 10, further comprising an 
embedded processor to provide said application-specific 
classification policies. 
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15. The policy engine of claim 10, further comprising a 
. plurality of producer/consumer ring arrays to said at least 

one pointer to at least one of said plurality of data buffers. 

16. The policy engine of claim 10, wherein said classifi- 
5 cation packets are returned to classification engine to be 

reclassified. 

17. An article of manufacture comprising: 

a recordable medium having recorded thereon a plurality 
of programing instructions for use to program a com- 
10 puting device to classify packets stored in at least one 
of a plurality of data buffers according to at least one of 
a plurality of application-specific classification packets. 

18. The article of manufacture of claim 17, wherein said 
programing instructions for use to program said computing 
device to classify said packets further comprise programing 

15 instructions to enable said computing device to: 

retrieve from one of a plurality of ring arrays a packet 
buffer pointer corresponding to a packet to be classi- 
fied; 

retrieve at least a portion of said packet to be classified 
20 indicated by said packet buffer pointer; 

execute a plurality of application-specific operations to 

extract information from said packet; and 
store results of said operations in a designated result area 
of said packet. 

25 19. The article of manufacture of claim 18, wherein said 
programming instructions for use to program said comput- 
ing device to classify said packets further comprise pro- 
graming instructions to enable said computing device to: 
generate a hash key from said extracted information and 
30 utilize said hash key to perform a lookup in a hash table 
to identify a record associated with at least one packet 
matching said key. 

20. The article of manufacture of claim 17, wherein said 
application-specific classification policies are dynamically 

35 provided by a host processor. 

21. The article of manufacture of claim 17, wherein said 
application-specific classification policies are dynamically 
provided by said computing device. 

22. A method comprising: 

40 storing packets in at least one of a plurality of data buffers; 
and 

classifying said packets stored in said at least one of a 
plurality of data buffers according to at least one of a 
plurality of application-specific classification policies. 
45 23. The method of claim 22, wherein said classifying 
packets comprises: 

retrieving from one of a plurality of producer/consumer 
ring arrays a packet buffer pointer corresponding to a 
packet to be classified; 
5Q retrieving at least a portion of said packet to be classified 
indicated by said packet buffer pointer; 
executing a plurality of application-specific operations to 

extract information from said packet; and 
storing results of said operations in a designated result- 
area of said packet. 
55 24. The method of claim 23, farther comprising: 

generating a hash key from said extracted information and 
utilizing said hash key to perform a lookup in a hash 
table to identify a record associated with at least one 
packet matching said key. 
60 25. The method of claim 22, wherein said application- 
specific classification policies are dynamically provided by 
a host processor. 

26. The method of claim 22, wherein said application- 
specific classification policies are dynamically provided by 
65 an embedded processor. 

***** 
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